3. Data visualisation
3.2.4 Exercises
- Run
ggplot(data = mpg). What do you see?
ggplot(data = mpg)Nothing. The plot is empty because no layers have been added to ggplot().
- How many rows are in
mpg? How many columns?
mpg## # A tibble: 234 x 11
## manufacturer model displ year cyl trans drv cty hwy fl
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr>
## 1 audi a4 1.80 1999 4 auto(l… f 18 29 p
## 2 audi a4 1.80 1999 4 manual… f 21 29 p
## 3 audi a4 2.00 2008 4 manual… f 20 31 p
## 4 audi a4 2.00 2008 4 auto(a… f 21 30 p
## 5 audi a4 2.80 1999 6 auto(l… f 16 26 p
## 6 audi a4 2.80 1999 6 manual… f 18 26 p
## 7 audi a4 3.10 2008 6 auto(a… f 18 27 p
## 8 audi a4 quat… 1.80 1999 4 manual… 4 18 26 p
## 9 audi a4 quat… 1.80 1999 4 auto(l… 4 16 25 p
## 10 audi a4 quat… 2.00 2008 4 manual… 4 20 28 p
## # ... with 224 more rows, and 1 more variable: class <chr>
234 rows and 11 columns in total.
- What does the
drvvariable describe? Read the help for?mpgto find out.
drv describes whether the car is front-wheel drive, rear wheel drive, or 4wd.
- Make a scatterplot of
hwyvscyl.
ggplot(mpg, aes(x = hwy, y = cyl)) +
geom_point()- What happens if you make a scatterplot of
classvsdrv? Why is the plot not useful?
ggplot(mpg, aes(x = class, y = drv)) +
geom_point()Scatterplots are useful for displaying continuous variables (e.g., cty and hwy). class and drv are categorical variables.
3.3.1 Exercises
What’s gone wrong with this code? Why are the points not blue?
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))
To make the points blue, colour must be set manually (i.e., it must be located outside aes()):
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "blue")- Which variables in
mpgare categorical? Which variables are continuous? (Hint: type?mpgto read the documentation for the dataset). How can you see this information when you runmpg?
mpg## # A tibble: 234 x 11
## manufacturer model displ year cyl trans drv cty hwy fl
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr>
## 1 audi a4 1.80 1999 4 auto(l… f 18 29 p
## 2 audi a4 1.80 1999 4 manual… f 21 29 p
## 3 audi a4 2.00 2008 4 manual… f 20 31 p
## 4 audi a4 2.00 2008 4 auto(a… f 21 30 p
## 5 audi a4 2.80 1999 6 auto(l… f 16 26 p
## 6 audi a4 2.80 1999 6 manual… f 18 26 p
## 7 audi a4 3.10 2008 6 auto(a… f 18 27 p
## 8 audi a4 quat… 1.80 1999 4 manual… 4 18 26 p
## 9 audi a4 quat… 1.80 1999 4 auto(l… 4 16 25 p
## 10 audi a4 quat… 2.00 2008 4 manual… 4 20 28 p
## # ... with 224 more rows, and 1 more variable: class <chr>
Categorical: manufacturer, model, trans, drv, fl, class
Continuous: displ, year, cty, hwy
This information is located below the variable name in the output (e.g., <chr> indicates a character string which is categorical).
Map a continuous variable to
color,size, andshape. How do these aesthetics behave differently for categorical vs. continuous variables?What happens if you map the same variable to multiple aesthetics?
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, colour = displ))It creates a gradient along whichever axis the variable is assigned to.
- What does the
strokeaesthetic do? What shapes does it work with? (Hint: use?geom_point)
stroke increases the border width of shapes. However, not all shapes have a border.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), stroke = 2, shape = 23)- What happens if you map an aesthetic to something other than a variable name, like
aes(colour = displ < 5)?
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, colour = displ < 5))In this example, each observation with displ < 5 is grouped together (TRUE). All remaining observations form a second group (FALSE).
4. Workflow: basics
4.4 Practice
Why does this code not work?
my_variable <- 10 my_varıable## Error in eval(expr, envir, enclos): object 'my_varıable' not foundLook carefully! (This may seem like an exercise in pointlessness, but training your brain to notice even the tiniest difference will pay off when programming.)
There is a typo on the second line. It should be my_variable, not my_varıable.
Tweak each of the following R commands so that they run correctly:
library(tidyverse) ggplot(dota = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) fliter(mpg, cyl = 8) filter(diamond, carat > 3)
library(tidyverse)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))filter(mpg, cyl == 8)## # A tibble: 70 x 11
## manufacturer model displ year cyl trans drv cty hwy fl
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr>
## 1 audi a6 quatt… 4.20 2008 8 auto(… 4 16 23 p
## 2 chevrolet c1500 su… 5.30 2008 8 auto(… r 14 20 r
## 3 chevrolet c1500 su… 5.30 2008 8 auto(… r 11 15 e
## 4 chevrolet c1500 su… 5.30 2008 8 auto(… r 14 20 r
## 5 chevrolet c1500 su… 5.70 1999 8 auto(… r 13 17 r
## 6 chevrolet c1500 su… 6.00 2008 8 auto(… r 12 17 r
## 7 chevrolet corvette 5.70 1999 8 manua… r 16 26 p
## 8 chevrolet corvette 5.70 1999 8 auto(… r 15 23 p
## 9 chevrolet corvette 6.20 2008 8 manua… r 16 26 p
## 10 chevrolet corvette 6.20 2008 8 auto(… r 15 25 p
## # ... with 60 more rows, and 1 more variable: class <chr>
filter(diamonds, carat > 3)## # A tibble: 32 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 3.01 Premium I I1 62.7 58. 8040 9.10 8.97 5.67
## 2 3.11 Fair J I1 65.9 57. 9823 9.15 9.02 5.98
## 3 3.01 Premium F I1 62.2 56. 9925 9.24 9.13 5.73
## 4 3.05 Premium E I1 60.9 58. 10453 9.26 9.25 5.66
## 5 3.02 Fair I I1 65.2 56. 10577 9.11 9.02 5.91
## 6 3.01 Fair H I1 56.1 62. 10761 9.54 9.38 5.31
## 7 3.65 Fair H I1 67.1 53. 11668 9.53 9.48 6.38
## 8 3.24 Premium H I1 62.1 58. 12300 9.44 9.40 5.85
## 9 3.22 Ideal I I1 62.6 55. 12545 9.49 9.42 5.92
## 10 3.50 Ideal H I1 62.8 57. 12587 9.65 9.59 6.03
## # ... with 22 more rows
- Press Alt + Shift + K. What happens? How can you get to the same place using the menus?
The Keyboard Shortcut Quick Reference overlay appears.
5. Data transformation
5.2.4 Exercises
Find all flights that
- Had an arrival delay of two or more hours
- Flew to Houston (
IAHorHOU) - Were operated by United, American, or Delta
- Departed in summer (July, August, and September)
- Arrived more than two hours late, but didn’t leave late
- Were delayed by at least an hour, but made up over 30 minutes in flight
- Departed between midnight and 6am (inclusive)
Had an arrival delay of two or more hours:
filter(flights, arr_delay >= 120)## # A tibble: 10,200 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 811 630 101. 1047
## 2 2013 1 1 848 1835 853. 1001
## 3 2013 1 1 957 733 144. 1056
## 4 2013 1 1 1114 900 134. 1447
## 5 2013 1 1 1505 1310 115. 1638
## 6 2013 1 1 1525 1340 105. 1831
## 7 2013 1 1 1549 1445 64. 1912
## 8 2013 1 1 1558 1359 119. 1718
## 9 2013 1 1 1732 1630 62. 2028
## 10 2013 1 1 1803 1620 103. 2008
## # ... with 10,190 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
Flew to Houston (IAH or HOU):
filter(flights, dest %in% c("IAH", "HOU"))## # A tibble: 9,313 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2. 830
## 2 2013 1 1 533 529 4. 850
## 3 2013 1 1 623 627 -4. 933
## 4 2013 1 1 728 732 -4. 1041
## 5 2013 1 1 739 739 0. 1104
## 6 2013 1 1 908 908 0. 1228
## 7 2013 1 1 1028 1026 2. 1350
## 8 2013 1 1 1044 1045 -1. 1352
## 9 2013 1 1 1114 900 134. 1447
## 10 2013 1 1 1205 1200 5. 1503
## # ... with 9,303 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
Were operated by United, American, or Delta:
filter(flights, carrier %in% c("UA", "AA", "DL"))## # A tibble: 139,504 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2. 830
## 2 2013 1 1 533 529 4. 850
## 3 2013 1 1 542 540 2. 923
## 4 2013 1 1 554 600 -6. 812
## 5 2013 1 1 554 558 -4. 740
## 6 2013 1 1 558 600 -2. 753
## 7 2013 1 1 558 600 -2. 924
## 8 2013 1 1 558 600 -2. 923
## 9 2013 1 1 559 600 -1. 941
## 10 2013 1 1 559 600 -1. 854
## # ... with 139,494 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
(To find the two letter carrier abbreviations, see airlines.)
Departed in summer (July, August, and September):
filter(flights, month %in% c(7, 8, 9))## # A tibble: 86,326 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 7 1 1 2029 212. 236
## 2 2013 7 1 2 2359 3. 344
## 3 2013 7 1 29 2245 104. 151
## 4 2013 7 1 43 2130 193. 322
## 5 2013 7 1 44 2150 174. 300
## 6 2013 7 1 46 2051 235. 304
## 7 2013 7 1 48 2001 287. 308
## 8 2013 7 1 58 2155 183. 335
## 9 2013 7 1 100 2146 194. 327
## 10 2013 7 1 100 2245 135. 337
## # ... with 86,316 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
Arrived more than two hours late, but didn’t leave late:
filter(flights, arr_delay >= 120 & dep_delay <= 0)## # A tibble: 29 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 27 1419 1420 -1. 1754
## 2 2013 10 7 1350 1350 0. 1736
## 3 2013 10 7 1357 1359 -2. 1858
## 4 2013 10 16 657 700 -3. 1258
## 5 2013 11 1 658 700 -2. 1329
## 6 2013 3 18 1844 1847 -3. 39
## 7 2013 4 17 1635 1640 -5. 2049
## 8 2013 4 18 558 600 -2. 1149
## 9 2013 4 18 655 700 -5. 1213
## 10 2013 5 22 1827 1830 -3. 2217
## # ... with 19 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
Were delayed by at least an hour, but made up over 30 minutes in flight:
filter(flights, dep_delay >= 60 & dep_delay - arr_delay > 30)## # A tibble: 1,844 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 2205 1720 285. 46
## 2 2013 1 1 2326 2130 116. 131
## 3 2013 1 3 1503 1221 162. 1803
## 4 2013 1 3 1839 1700 99. 2056
## 5 2013 1 3 1850 1745 65. 2148
## 6 2013 1 3 1941 1759 102. 2246
## 7 2013 1 3 1950 1845 65. 2228
## 8 2013 1 3 2015 1915 60. 2135
## 9 2013 1 3 2257 2000 177. 45
## 10 2013 1 4 1917 1700 137. 2135
## # ... with 1,834 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
Departed between midnight and 6am (inclusive):
filter(flights, dep_time <= 600 | dep_time == 2400)## # A tibble: 9,373 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2. 830
## 2 2013 1 1 533 529 4. 850
## 3 2013 1 1 542 540 2. 923
## 4 2013 1 1 544 545 -1. 1004
## 5 2013 1 1 554 600 -6. 812
## 6 2013 1 1 554 558 -4. 740
## 7 2013 1 1 555 600 -5. 913
## 8 2013 1 1 557 600 -3. 709
## 9 2013 1 1 557 600 -3. 838
## 10 2013 1 1 558 600 -2. 753
## # ... with 9,363 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
- Another useful dplyr filtering helper is
between(). What does it do? Can you use it to simplify the code needed to answer the previous challenges?
It is a shortcut for finding values that fall within a specified range:
filter(flights, between(month, 7, 9))## # A tibble: 86,326 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 7 1 1 2029 212. 236
## 2 2013 7 1 2 2359 3. 344
## 3 2013 7 1 29 2245 104. 151
## 4 2013 7 1 43 2130 193. 322
## 5 2013 7 1 44 2150 174. 300
## 6 2013 7 1 46 2051 235. 304
## 7 2013 7 1 48 2001 287. 308
## 8 2013 7 1 58 2155 183. 335
## 9 2013 7 1 100 2146 194. 327
## 10 2013 7 1 100 2245 135. 337
## # ... with 86,316 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
- How many flights have a missing
dep_time? What other variables are missing? What might these rows represent?
filter(flights, is.na(dep_time))## # A tibble: 8,255 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 NA 1630 NA NA
## 2 2013 1 1 NA 1935 NA NA
## 3 2013 1 1 NA 1500 NA NA
## 4 2013 1 1 NA 600 NA NA
## 5 2013 1 2 NA 1540 NA NA
## 6 2013 1 2 NA 1620 NA NA
## 7 2013 1 2 NA 1355 NA NA
## 8 2013 1 2 NA 1420 NA NA
## 9 2013 1 2 NA 1321 NA NA
## 10 2013 1 2 NA 1545 NA NA
## # ... with 8,245 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
8,255 flights.
To see what other variables are missing, you can use View():
View(filter(flights, is.na(dep_time)))dep_delay, arr_time, arr_delay, and air_time are also missing. These flights were most likely cancelled.
- Why is
NA ^ 0not missing? Why isNA | TRUEnot missing? Why isFALSE & NAnot missing? Can you figure out the general rule? (NA * 0is a tricky counterexample!)
See Jeffrey Arnold’s explanation.
5.3.1 Exercises
- How could you use
arrange()to sort all missing values to the start? (Hint: useis.na()).
arrange(flights, desc(is.na(dep_time)))## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 NA 1630 NA NA
## 2 2013 1 1 NA 1935 NA NA
## 3 2013 1 1 NA 1500 NA NA
## 4 2013 1 1 NA 600 NA NA
## 5 2013 1 2 NA 1540 NA NA
## 6 2013 1 2 NA 1620 NA NA
## 7 2013 1 2 NA 1355 NA NA
## 8 2013 1 2 NA 1420 NA NA
## 9 2013 1 2 NA 1321 NA NA
## 10 2013 1 2 NA 1545 NA NA
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
- Sort
flightsto find the most delayed flights. Find the flights that left earliest.
arrange(flights, desc(dep_delay))## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 9 641 900 1301. 1242
## 2 2013 6 15 1432 1935 1137. 1607
## 3 2013 1 10 1121 1635 1126. 1239
## 4 2013 9 20 1139 1845 1014. 1457
## 5 2013 7 22 845 1600 1005. 1044
## 6 2013 4 10 1100 1900 960. 1342
## 7 2013 3 17 2321 810 911. 135
## 8 2013 6 27 959 1900 899. 1236
## 9 2013 7 22 2257 759 898. 121
## 10 2013 12 5 756 1700 896. 1058
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
arrange(flights, dep_delay)## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 12 7 2040 2123 -43. 40
## 2 2013 2 3 2022 2055 -33. 2240
## 3 2013 11 10 1408 1440 -32. 1549
## 4 2013 1 11 1900 1930 -30. 2233
## 5 2013 1 29 1703 1730 -27. 1947
## 6 2013 8 9 729 755 -26. 1002
## 7 2013 10 23 1907 1932 -25. 2143
## 8 2013 3 30 2030 2055 -25. 2213
## 9 2013 3 2 1431 1455 -24. 1601
## 10 2013 5 5 934 958 -24. 1225
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
- Sort
flightsto find the fastest flights.
This depends how you define “fastest flights”. It could refer to the least amount of air_time:
arrange(flights, air_time)## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 16 1355 1315 40. 1442
## 2 2013 4 13 537 527 10. 622
## 3 2013 12 6 922 851 31. 1021
## 4 2013 2 3 2153 2129 24. 2247
## 5 2013 2 5 1303 1315 -12. 1342
## 6 2013 2 12 2123 2130 -7. 2211
## 7 2013 3 2 1450 1500 -10. 1547
## 8 2013 3 8 2026 1935 51. 2131
## 9 2013 3 18 1456 1329 87. 1533
## 10 2013 3 19 2226 2145 41. 2305
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
- Which flights travelled the longest? Which travelled the shortest?
If we assume this refers to distance travelled:
arrange(flights, desc(distance))## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 857 900 -3. 1516
## 2 2013 1 2 909 900 9. 1525
## 3 2013 1 3 914 900 14. 1504
## 4 2013 1 4 900 900 0. 1516
## 5 2013 1 5 858 900 -2. 1519
## 6 2013 1 6 1019 900 79. 1558
## 7 2013 1 7 1042 900 102. 1620
## 8 2013 1 8 901 900 1. 1504
## 9 2013 1 9 641 900 1301. 1242
## 10 2013 1 10 859 900 -1. 1449
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
The longest flight is from JFK to HNL (Honolulu) at 4983 miles.
arrange(flights, distance)## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 7 27 NA 106 NA NA
## 2 2013 1 3 2127 2129 -2. 2222
## 3 2013 1 4 1240 1200 40. 1333
## 4 2013 1 4 1829 1615 134. 1937
## 5 2013 1 4 2128 2129 -1. 2218
## 6 2013 1 5 1155 1200 -5. 1241
## 7 2013 1 6 2125 2129 -4. 2224
## 8 2013 1 7 2124 2129 -5. 2212
## 9 2013 1 8 2127 2130 -3. 2304
## 10 2013 1 9 2126 2129 -3. 2217
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
The shortest flight is from EWR (Newark) to LGA (La Guardia) at 17 miles, however this appears to be a cancelled flight (a number of variables such as dep_time are missing). The next shortest flight is from EWR to PHL (Philadelphia) at 80 miles.
(See airports to find the names that correspond to the FAA airport code.)
5.4.1 Exercises
- Brainstorm as many ways as possible to select
dep_time,dep_delay,arr_time, andarr_delayfromflights.
select(flights, dep_time, dep_delay, arr_time, arr_delay)## # A tibble: 336,776 x 4
## dep_time dep_delay arr_time arr_delay
## <int> <dbl> <int> <dbl>
## 1 517 2. 830 11.
## 2 533 4. 850 20.
## 3 542 2. 923 33.
## 4 544 -1. 1004 -18.
## 5 554 -6. 812 -25.
## 6 554 -4. 740 12.
## 7 555 -5. 913 19.
## 8 557 -3. 709 -14.
## 9 557 -3. 838 -8.
## 10 558 -2. 753 8.
## # ... with 336,766 more rows
select(flights, starts_with("dep"), starts_with("arr"))## # A tibble: 336,776 x 4
## dep_time dep_delay arr_time arr_delay
## <int> <dbl> <int> <dbl>
## 1 517 2. 830 11.
## 2 533 4. 850 20.
## 3 542 2. 923 33.
## 4 544 -1. 1004 -18.
## 5 554 -6. 812 -25.
## 6 554 -4. 740 12.
## 7 555 -5. 913 19.
## 8 557 -3. 709 -14.
## 9 557 -3. 838 -8.
## 10 558 -2. 753 8.
## # ... with 336,766 more rows
- What happens if you include the name of a variable multiple times in a
select()call?
It will only print once.
select(flights, dep_time, dep_time)## # A tibble: 336,776 x 1
## dep_time
## <int>
## 1 517
## 2 533
## 3 542
## 4 544
## 5 554
## 6 554
## 7 555
## 8 557
## 9 557
## 10 558
## # ... with 336,766 more rows
What does the
one_of()function do? Why might it be helpful in conjunction with this vector?vars <- c("year", "month", "day", "dep_delay", "arr_delay")
The one_of() function allows you to use variables stored in a character vector:
select(flights, one_of(vars))## # A tibble: 336,776 x 5
## year month day dep_delay arr_delay
## <int> <int> <int> <dbl> <dbl>
## 1 2013 1 1 2. 11.
## 2 2013 1 1 4. 20.
## 3 2013 1 1 2. 33.
## 4 2013 1 1 -1. -18.
## 5 2013 1 1 -6. -25.
## 6 2013 1 1 -4. 12.
## 7 2013 1 1 -5. 19.
## 8 2013 1 1 -3. -14.
## 9 2013 1 1 -3. -8.
## 10 2013 1 1 -2. 8.
## # ... with 336,766 more rows
It improves the readability of code.
Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?
select(flights, contains("TIME"))
Select helpers are not case sensitive by default. To change the default, use ignore.case = FALSE.
5.5.2 Exercises
- Currently
dep_timeandsched_dep_timeare convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.
transmute(flights,
dep_time,
hour = dep_time %/% 100,
minute = dep_time %% 100,
dep_time_mins = (hour * 60) + minute
)## # A tibble: 336,776 x 4
## dep_time hour minute dep_time_mins
## <int> <dbl> <dbl> <dbl>
## 1 517 5. 17. 317.
## 2 533 5. 33. 333.
## 3 542 5. 42. 342.
## 4 544 5. 44. 344.
## 5 554 5. 54. 354.
## 6 554 5. 54. 354.
## 7 555 5. 55. 355.
## 8 557 5. 57. 357.
## 9 557 5. 57. 357.
## 10 558 5. 58. 358.
## # ... with 336,766 more rows
transmute(flights,
sched_dep_time,
hour = sched_dep_time %/% 100,
minute = sched_dep_time %% 100,
sched_dep_mins = (hour * 60) + minute
)## # A tibble: 336,776 x 4
## sched_dep_time hour minute sched_dep_mins
## <int> <dbl> <dbl> <dbl>
## 1 515 5. 15. 315.
## 2 529 5. 29. 329.
## 3 540 5. 40. 340.
## 4 545 5. 45. 345.
## 5 600 6. 0. 360.
## 6 558 5. 58. 358.
## 7 600 6. 0. 360.
## 8 600 6. 0. 360.
## 9 600 6. 0. 360.
## 10 600 6. 0. 360.
## # ... with 336,766 more rows
- Compare
air_timewitharr_time - dep_time. What do you expect to see? What do you see? What do you need to do to fix it?
transmute(flights, air_time, arr_time - dep_time)## # A tibble: 336,776 x 2
## air_time `arr_time - dep_time`
## <dbl> <int>
## 1 227. 313
## 2 227. 317
## 3 160. 381
## 4 183. 460
## 5 116. 258
## 6 150. 186
## 7 158. 358
## 8 53. 152
## 9 140. 281
## 10 138. 195
## # ... with 336,766 more rows
You would expect arr_time - dep_time == air_time. This discrepancy is explained by timezone differences. arr_time and dep_time are in local timezones that may differ.
- Compare
dep_time,sched_dep_time, anddep_delay. How would you expect those three numbers to be related?
select(flights, dep_time, sched_dep_time, dep_delay)## # A tibble: 336,776 x 3
## dep_time sched_dep_time dep_delay
## <int> <int> <dbl>
## 1 517 515 2.
## 2 533 529 4.
## 3 542 540 2.
## 4 544 545 -1.
## 5 554 600 -6.
## 6 554 558 -4.
## 7 555 600 -5.
## 8 557 600 -3.
## 9 557 600 -3.
## 10 558 600 -2.
## # ... with 336,766 more rows
dep_time - sched_dep_time == dep_delay.
- Find the 10 most delayed flights using a ranking function. How do you want to handle ties? Carefully read the documentation for
min_rank().
flights_ranked <- mutate(flights, rank = min_rank(-dep_delay))
arrange(flights_ranked, desc(dep_delay))## # A tibble: 336,776 x 20
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 9 641 900 1301. 1242
## 2 2013 6 15 1432 1935 1137. 1607
## 3 2013 1 10 1121 1635 1126. 1239
## 4 2013 9 20 1139 1845 1014. 1457
## 5 2013 7 22 845 1600 1005. 1044
## 6 2013 4 10 1100 1900 960. 1342
## 7 2013 3 17 2321 810 911. 135
## 8 2013 6 27 959 1900 899. 1236
## 9 2013 7 22 2257 759 898. 121
## 10 2013 12 5 756 1700 896. 1058
## # ... with 336,766 more rows, and 13 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, rank <int>
Ties should be assigned the lowest rank. ties.method = "min" is the default for min_rank().
- What does
1:3 + 1:10return? Why?
1:3 + 1:10## Warning in 1:3 + 1:10: longer object length is not a multiple of shorter
## object length
## [1] 2 4 6 5 7 9 8 10 12 11
1 + 1 = 2; 2 + 2 = 4; 3 + 3 = 6; 4 + 1 = 5; 5 + 2 = 7, and so on.
- What trigonometric functions does R provide?
Cosine, sine, tangent, arc-cosine, and more. See ?Trig.
5.6.7 Exercises
Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights. Consider the following scenarios:
A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.
A flight is always 10 minutes late.
A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time.
99% of the time a flight is on time. 1% of the time it’s 2 hours late.
Which is more important: arrival delay or departure delay?
flights %>%
group_by(flight, origin, dest) %>%
filter(arr_delay == 10) %>%
summarise()## # A tibble: 1,877 x 3
## # Groups: flight, origin [?]
## flight origin dest
## <int> <chr> <chr>
## 1 1 JFK FLL
## 2 3 EWR DEN
## 3 3 JFK LAX
## 4 3 JFK SJU
## 5 3 LGA MDW
## 6 4 JFK BUF
## 7 4 JFK SJU
## 8 5 EWR FLL
## 9 5 EWR SEA
## 10 6 JFK BUF
## # ... with 1,867 more rows
flights %>%
group_by(flight, origin, dest) %>%
summarise(avg_arr_delay = mean(arr_delay, na.rm = TRUE), count = n()) %>%
filter(avg_arr_delay == 10)## # A tibble: 65 x 5
## # Groups: flight, origin [64]
## flight origin dest avg_arr_delay count
## <int> <chr> <chr> <dbl> <int>
## 1 3 EWR DEN 10. 23
## 2 102 LGA MKE 10. 1
## 3 171 EWR DEN 10. 5
## 4 252 EWR RSW 10. 2
## 5 265 EWR FLL 10. 2
## 6 271 EWR TPA 10. 1
## 7 279 LGA IAH 10. 1
## 8 299 EWR PHX 10. 4
## 9 316 EWR SFO 10. 1
## 10 363 EWR DFW 10. 1
## # ... with 55 more rows
Departure delay is probably more important for travellers.
- Come up with another approach that will give you the same output as
not_cancelled %>% count(dest)andnot_cancelled %>% count(tailnum, wt = distance)(without usingcount()).
not_cancelled %>%
group_by(dest) %>%
summarise(count = n())## # A tibble: 104 x 2
## dest count
## <chr> <int>
## 1 ABQ 254
## 2 ACK 264
## 3 ALB 418
## 4 ANC 8
## 5 ATL 16837
## 6 AUS 2411
## 7 AVL 261
## 8 BDL 412
## 9 BGR 358
## 10 BHM 269
## # ... with 94 more rows
not_cancelled %>%
group_by(tailnum) %>%
summarise(distance_total = sum(distance))## # A tibble: 4,037 x 2
## tailnum distance_total
## <chr> <dbl>
## 1 D942DN 3418.
## 2 N0EGMQ 239143.
## 3 N10156 109664.
## 4 N102UW 25722.
## 5 N103US 24619.
## 6 N104UW 24616.
## 7 N10575 139903.
## 8 N105UW 23618.
## 9 N107US 21677.
## 10 N108UW 32070.
## # ... with 4,027 more rows
- Our definition of cancelled flights (
is.na(dep_delay) | is.na(arr_delay)) is slightly suboptimal. Why? Which is the most important column?
Flights that do no depart cannot arrive at any location. However, flights that depart but are NA for arr_delay may have redirected or crashed.
The importance of the column depends on the question being answered. dep_delay is probably more representative of cancelled flights for the above reason.
- Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay?
flights %>%
group_by(month, day) %>%
summarise(cancelled = sum(is.na(dep_delay)))## # A tibble: 365 x 3
## # Groups: month [?]
## month day cancelled
## <int> <int> <int>
## 1 1 1 4
## 2 1 2 8
## 3 1 3 10
## 4 1 4 6
## 5 1 5 3
## 6 1 6 1
## 7 1 7 3
## 8 1 8 4
## 9 1 9 5
## 10 1 10 3
## # ... with 355 more rows
It is difficult to tell just from this output. Let’s visualise this instead:
flights %>%
mutate(cancelled = is.na(dep_delay)) %>%
group_by(month, day) %>%
summarise(
prop_cancelled = mean(cancelled),
avg_dep_delay = mean(dep_delay, na.rm = TRUE)
) %>%
ggplot(aes(x = prop_cancelled, y = avg_dep_delay)) +
geom_point() +
geom_smooth(se = TRUE)## `geom_smooth()` using method = 'loess'
There is a positive relationship between the proportion of cancelled flights and average delay. However, there are a few exceptional days with a high number of cancelled flights and relatively low average delay. If I had to guess, this might be related to bad weather (i.e., all flights were cancelled for a specific time period, so avg_dep_delay was close to zero).
- Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about
flights %>% group_by(carrier, dest) %>% summarise(n()))
Frontier Airlines (F9) has the worst delays:
not_cancelled %>%
group_by(carrier) %>%
summarise(avg_delay = mean(dep_delay, na.rm = TRUE)) %>%
arrange(desc(avg_delay))## # A tibble: 16 x 2
## carrier avg_delay
## <chr> <dbl>
## 1 F9 20.2
## 2 EV 19.8
## 3 YV 18.9
## 4 FL 18.6
## 5 WN 17.7
## 6 9E 16.4
## 7 B6 13.0
## 8 VX 12.8
## 9 OO 12.6
## 10 UA 12.0
## 11 MQ 10.4
## 12 DL 9.22
## 13 AA 8.57
## 14 AS 5.83
## 15 HA 4.90
## 16 US 3.74
You cannot disentangle the effects of bad airports vs. bad carriers because there is an uneven distribution in destinations by carrier. Some carriers might operate more frequently from bad airports and vice versa.
- What does the
sortargument tocount()do. When might you use it?
sort = TRUE will sort output in descending order of n. You can use it for ranking instead of using min_rank() and/or arrange().
not_cancelled %>% count(dest, sort = TRUE)## # A tibble: 104 x 2
## dest n
## <chr> <int>
## 1 ATL 16837
## 2 ORD 16566
## 3 LAX 16026
## 4 BOS 15022
## 5 MCO 13967
## 6 CLT 13674
## 7 SFO 13173
## 8 FLL 11897
## 9 MIA 11593
## 10 DCA 9111
## # ... with 94 more rows
5.7.1 Exercises
- Refer back to the lists of useful mutate and filtering functions. Describe how each operation changes when you combine it with grouping.
When combined with grouping, functions operate within groups only, not the entire dataset. For example:
# What is the mean air time for each carrier?
flights %>%
group_by(carrier) %>%
summarise(mean_air_time = mean(air_time, na.rm = TRUE))## # A tibble: 16 x 2
## carrier mean_air_time
## <chr> <dbl>
## 1 9E 86.8
## 2 AA 189.
## 3 AS 326.
## 4 B6 151.
## 5 DL 174.
## 6 EV 90.1
## 7 F9 230.
## 8 FL 101.
## 9 HA 623.
## 10 MQ 91.2
## 11 OO 83.5
## 12 UA 212.
## 13 US 88.6
## 14 VX 337.
## 15 WN 148.
## 16 YV 65.7
# What is the mean arrival delay for each month excluding all flights that were not delayed?
flights %>%
filter(arr_delay > 0) %>%
group_by(month) %>%
summarise(avg_arr_delay = mean(arr_delay))## # A tibble: 12 x 2
## month avg_arr_delay
## <int> <dbl>
## 1 1 34.5
## 2 2 33.7
## 3 3 40.6
## 4 4 42.7
## 5 5 41.9
## 6 6 53.7
## 7 7 54.0
## 8 8 39.5
## 9 9 38.8
## 10 10 29.0
## 11 11 27.5
## 12 12 39.7
- Which plane (
tailnum) has the worst on-time record?
N844MH.
flights %>%
group_by(tailnum) %>%
summarise(avg_delay = mean(arr_delay, na.rm = TRUE)) %>%
arrange(desc(avg_delay))## # A tibble: 4,044 x 2
## tailnum avg_delay
## <chr> <dbl>
## 1 N844MH 320.
## 2 N911DA 294.
## 3 N922EV 276.
## 4 N587NW 264.
## 5 N851NW 219.
## 6 N928DN 201.
## 7 N7715E 188.
## 8 N654UA 185.
## 9 N665MQ 175.
## 10 N427SW 157.
## # ... with 4,034 more rows
- What time of day should you fly if you want to avoid delays as much as possible?
Morning, preferably before 9 am.
flights %>%
group_by(hour) %>%
summarise(avg_delay = mean(arr_delay, na.rm = TRUE)) %>%
arrange(avg_delay)## # A tibble: 20 x 2
## hour avg_delay
## <dbl> <dbl>
## 1 7. -5.30
## 2 5. -4.80
## 3 6. -3.38
## 4 9. -1.45
## 5 8. -1.11
## 6 10. 0.954
## 7 11. 1.48
## 8 12. 3.49
## 9 13. 6.54
## 10 14. 9.20
## 11 23. 11.8
## 12 15. 12.3
## 13 16. 12.6
## 14 18. 14.8
## 15 22. 16.0
## 16 17. 16.0
## 17 19. 16.7
## 18 20. 16.7
## 19 21. 18.4
## 20 1. NaN
- For each destination, compute the total minutes of delay. For each, flight, compute the proportion of the total delay for its destination.
flights %>%
group_by(dest) %>%
summarise(total_delay = sum(arr_delay, na.rm = TRUE))## # A tibble: 105 x 2
## dest total_delay
## <chr> <dbl>
## 1 ABQ 1113.
## 2 ACK 1281.
## 3 ALB 6018.
## 4 ANC -20.
## 5 ATL 190260.
## 6 AUS 14514.
## 7 AVL 2089.
## 8 BDL 2904.
## 9 BGR 2874.
## 10 BHM 4540.
## # ... with 95 more rows
- Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave. Using
lag()explore how the delay of a flight is related to the delay of the immediately preceding flight.
flights %>%
group_by(year, month, day) %>%
mutate(lag_delay = lag(dep_delay)) %>%
ggplot(aes(x = dep_delay, y = lag_delay)) +
geom_point() +
geom_smooth()## `geom_smooth()` using method = 'gam'
## Warning: Removed 8620 rows containing non-finite values (stat_smooth).
## Warning: Removed 8620 rows containing missing values (geom_point).
Look at each destination. Can you find flights that are suspiciously fast? (i.e. flights that represent a potential data entry error). Compute the air time a flight relative to the shortest flight to that destination. Which flights were most delayed in the air?
Find all destinations that are flown by at least two carriers. Use that information to rank the carriers.
flights %>%
group_by(dest, carrier) %>%
count(carrier) %>%
filter(carrier >= 2) %>%
group_by(carrier) %>%
count(sort = TRUE)## # A tibble: 16 x 2
## # Groups: carrier [16]
## carrier nn
## <chr> <int>
## 1 EV 61
## 2 9E 49
## 3 UA 47
## 4 B6 42
## 5 DL 40
## 6 MQ 20
## 7 AA 19
## 8 WN 11
## 9 US 6
## 10 OO 5
## 11 VX 5
## 12 FL 3
## 13 YV 3
## 14 AS 1
## 15 F9 1
## 16 HA 1
- For each plane, count the number of flights before the first delay of greater than 1 hour.
flights %>%
group_by(flight) %>%
filter(dep_delay <= 60) %>%
count()## # A tibble: 3,810 x 2
## # Groups: flight [3,810]
## flight n
## <int> <int>
## 1 1 678
## 2 2 51
## 3 3 609
## 4 4 378
## 5 5 306
## 6 6 198
## 7 7 210
## 8 8 226
## 9 9 135
## 10 10 51
## # ... with 3,800 more rows
6. Workflow: scripts
6.3 Practice
- Go to the RStudio Tips twitter account, https://twitter.com/rstudiotips and find one tip that looks interesting. Practice using it!
Ctrl + 1 (focus editor) and Ctrl + 2 (focus console) are my personal favourites.
- What other common mistakes will RStudio diagnostics report? Read https://support.rstudio.com/hc/en-us/articles/205753617-Code-Diagnostics to find out.
Common mistakes such as:
- Warn if variable used has no definition in scope
- Warn if variable is defined but not used
- Style
Also, diagnostics for other languages such as C++, JavaScript, and Python.
7. Exploratory Data Analysis
7.3.4 Exercises
- Explore the distribution of each of the
x,y, andzvariables indiamonds. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = x), binwidth = 0.5)ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = x), binwidth = 0.1) +
coord_cartesian(ylim = c(0, 500))diamonds %>%
count(cut_width(x, 0.5))## # A tibble: 16 x 2
## `cut_width(x, 0.5)` n
## <fct> <int>
## 1 [-0.25,0.25] 8
## 2 (3.25,3.75] 3
## 3 (3.75,4.25] 1834
## 4 (4.25,4.75] 12680
## 5 (4.75,5.25] 7502
## 6 (5.25,5.75] 6448
## 7 (5.75,6.25] 6031
## 8 (6.25,6.75] 9381
## 9 (6.75,7.25] 4193
## 10 (7.25,7.75] 3437
## 11 (7.75,8.25] 1620
## 12 (8.25,8.75] 699
## 13 (8.75,9.25] 79
## 14 (9.25,9.75] 18
## 15 (9.75,10.2] 6
## 16 (10.2,10.8] 1
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5)ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
coord_cartesian(ylim = c(0, 50))diamonds %>%
count(cut_width(y, 0.5))## # A tibble: 18 x 2
## `cut_width(y, 0.5)` n
## <fct> <int>
## 1 [-0.25,0.25] 7
## 2 (3.25,3.75] 6
## 3 (3.75,4.25] 1730
## 4 (4.25,4.75] 12566
## 5 (4.75,5.25] 7556
## 6 (5.25,5.75] 6272
## 7 (5.75,6.25] 6464
## 8 (6.25,6.75] 9382
## 9 (6.75,7.25] 4176
## 10 (7.25,7.75] 3425
## 11 (7.75,8.25] 1612
## 12 (8.25,8.75] 654
## 13 (8.75,9.25] 67
## 14 (9.25,9.75] 14
## 15 (9.75,10.2] 6
## 16 (10.2,10.8] 1
## 17 (31.8,32.2] 1
## 18 (58.8,59.2] 1
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = z), binwidth = 0.5)ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = z), binwidth = 0.1) +
coord_cartesian(ylim = c(0, 50))diamonds %>%
count(cut_width(z, 0.5))## # A tibble: 16 x 2
## `cut_width(z, 0.5)` n
## <fct> <int>
## 1 [-0.25,0.25] 20
## 2 (0.75,1.25] 1
## 3 (1.25,1.75] 2
## 4 (1.75,2.25] 3
## 5 (2.25,2.75] 9276
## 6 (2.75,3.25] 13340
## 7 (3.25,3.75] 9572
## 8 (3.75,4.25] 13584
## 9 (4.25,4.75] 5589
## 10 (4.75,5.25] 2288
## 11 (5.25,5.75] 238
## 12 (5.75,6.25] 19
## 13 (6.25,6.75] 5
## 14 (6.75,7.25] 1
## 15 (7.75,8.25] 1
## 16 (31.8,32.2] 1
There are a number of outliers in each distribution that require further investigation.
As seen when you compute the bin widths and corresponding counts by hand, the most common value for x and y is between 4.25 and 4.75. For z, the most common value is between 3.75 and 4.25. The bin width from 2.75 to 3.25 is also very common. Based on these values, x and y appear to be the length and width of the diamond, while z is most likely the depth. This is confirmed in the documentation (?diamonds).
- Explore the distribution of
price. Do you discover anything unusual or surprising? (Hint: Carefully think about thebinwidthand make sure you try a wide range of values.)
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = price))## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = price), binwidth = 10)There appears to be no diamonds with a price around $1300, as indicated by the missing gap in the histogram above. This is not consistent with the overall pattern.
- How many diamonds are 0.99 carat? How many are 1 carat? What do you think is the cause of the difference?
diamonds %>%
count(carat) %>%
filter(carat %in% c(0.99, 1))## # A tibble: 2 x 2
## carat n
## <dbl> <int>
## 1 0.990 23
## 2 1.00 1558
23 diamonds are 0.99 carat. In comparison, there are over 1,558 diamonds that are 1 carat.
Maybe 0.99 carat diamonds are more likely to be rounded up.
diamonds %>%
filter(carat == 0.99) %>%
select(carat, cut, price, x, y, z)## # A tibble: 23 x 6
## carat cut price x y z
## <dbl> <ord> <int> <dbl> <dbl> <dbl>
## 1 0.990 Fair 2811 6.21 6.06 4.18
## 2 0.990 Fair 2812 6.72 6.67 3.68
## 3 0.990 Fair 2949 6.57 6.50 3.79
## 4 0.990 Fair 3337 6.42 6.34 3.87
## 5 0.990 Fair 3593 5.94 5.80 4.20
## 6 0.990 Very Good 4002 6.44 6.49 3.90
## 7 0.990 Good 4052 6.36 6.43 4.05
## 8 0.990 Premium 4075 6.45 6.38 3.89
## 9 0.990 Ideal 4763 6.40 6.42 3.96
## 10 0.990 Very Good 4780 6.30 6.33 3.90
## # ... with 13 more rows
diamonds %>%
filter(carat == 1) %>%
select(carat, cut, price, x, y, z) %>%
ggplot(mapping = aes(x = cut)) +
geom_bar()diamonds %>%
filter(carat == 1) %>%
select(price) %>%
summary(price)## price
## Min. : 1681
## 1st Qu.: 4155
## Median : 4864
## Mean : 5242
## 3rd Qu.: 6073
## Max. :16469
diamonds %>%
filter(carat == 1) %>%
select(carat, cut, price, x, y, z) %>%
ggplot(mapping = aes(x = price)) +
geom_histogram(binwidth = 250)- Compare and contrast
coord_cartesian()vsxlim()orylim()when zooming in on a histogram. What happens if you leavebinwidthunset? What happens if you try and zoom so only half a bar shows?
ggplot(diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
coord_cartesian(ylim = c(0, 50))ggplot(diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
ylim(0, 50)## Warning: Removed 11 rows containing missing values (geom_bar).
xlim() and ylim() throw away data outside the limits.
ggplot(diamonds) +
geom_histogram(mapping = aes(x = y)) +
coord_cartesian(ylim = c(0, 50))## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(diamonds) +
geom_histogram(mapping = aes(x = y)) +
ylim(0, 50)## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 3 rows containing missing values (geom_bar).
If binwidth is unset, it is easier to discern the bars when zooming in on a plot.
If you try to zoom with xlim() or ylim() so only half a bar shows, the data in that binwidth will be thrown away. So, the bar will disappear from the plot.
7.4.1 Exercises
- What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference?
In a histogram and bar chart, by default, missing values are removed with a warning. So, there is no actual difference as the question indicates.
- What does
na.rm = TRUEdo inmean()andsum()?
It removes missing values before the computation proceeds.
7.5.1.1 Exercises
- Use what you’ve learned to improve the visualisation of the departure times of cancelled vs. non-cancelled flights.
flights %>%
mutate(
cancelled = is.na(dep_time),
sched_hour = sched_dep_time %/% 100,
sched_min = sched_dep_time %% 100,
sched_dep_time = sched_hour + sched_min / 60
) %>%
ggplot(mapping = aes(sched_dep_time)) +
geom_boxplot(mapping = aes(x = cancelled, y = sched_dep_time))- What variable in the diamonds dataset is most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships lead to lower quality diamonds being more expensive?
Return to this question after completing the section on modelling. If I had to guess, I think carat might be the most important for predicting the price of a diamond. There is an exponential relationship between carat and price:
ggplot(diamonds, mapping = aes(x = carat, y = price)) +
geom_point(alpha = 1/100)- Install the ggstance package, and create a horizontal boxplot. How does this compare to using
coord_flip()?
ggstance provides flipped versions of geoms, stats, and positions. This means you can provide aesthetics in their natural order, making the code more readable:
ggplot(data = mpg) +
geom_boxploth(mapping = aes(x = hwy, y = reorder(class, hwy, FUN = median)))In comparsion, coord_flip() can only flip the plot as a whole.
- One problem with boxplots is that they were developed in an era of much smaller datasets and tend to display a prohibitively large number of “outlying values”. One approach to remedy this problem is the letter value plot. Install the lvplot package, and try using
geom_lv()to display the distribution of price vs cut. What do you learn? How do you interpret the plots?
ggplot(diamonds) +
geom_boxplot(mapping = aes(x = cut, y = price))The boxplots show many outliers, which is typical for large sample sizes (n > 1000).
ggplot(data = diamonds) +
geom_lv(mapping = aes(x = cut, y = price))Using the more precise (for n > 1000) letter-value box plots, there are less outliers. With letter-value box plots, the fattest box indicates the IQR.
- Compare and contrast
geom_violin()with a facettedgeom_histogram(), or a colouredgeom_freqpoly(). What are the pros and cons of each method?
ggplot(diamonds) +
geom_violin(mapping = aes(x = cut, y = price))ggplot(diamonds) +
geom_histogram(mapping = aes(x = price), binwidth = 1000) +
facet_wrap(~ cut)geom_violin() displays everything in one plot. However, it only shows the variability within the cut variable. In comparison, the facetted geom_histogram() shows the overall count for each binwidth. So, from the facetted histogram, it can be seen that not many diamonds have cut == Fair compared to, say, cut == Ideal.
- If you have a small dataset, it’s sometimes useful to use
geom_jitter()to see the relationship between a continuous and categorical variable. The ggbeeswarm package provides a number of methods similar togeom_jitter(). List them and briefly describe what each one does.
ggbeeswarm provides two methods to reduce overplotting: geom_quasirandom and geom_beeswarm. Both methods show the density of the data at each point, while still showing each individual point.
ggplot(iris,aes(Species, Sepal.Length)) +
geom_point()ggplot(iris,aes(Species, Sepal.Length)) +
geom_jitter()ggplot(iris, mapping = aes(Species, Sepal.Length)) +
geom_quasirandom()ggplot(iris, mapping = aes(Species, Sepal.Length)) +
geom_beeswarm()geom_quasirandom spaces dots using a van der Corput sequence or Tukey texturing. In comparison, geom_beeswarm uses the beeswarm library.
7.5.2.1 Exercises
- How could you rescale the count dataset above to more clearly show the distribution of cut within colour, or colour within cut?
diamonds %>%
count(cut, color)## # A tibble: 35 x 3
## cut color n
## <ord> <ord> <int>
## 1 Fair D 163
## 2 Fair E 224
## 3 Fair F 312
## 4 Fair G 314
## 5 Fair H 303
## 6 Fair I 175
## 7 Fair J 119
## 8 Good D 662
## 9 Good E 933
## 10 Good F 909
## # ... with 25 more rows
diamonds %>%
count(color, cut) %>%
ggplot(mapping = aes(x = color, y = cut)) +
geom_tile(mapping = aes(fill = n))diamonds %>%
count(cut, color) %>%
ggplot(mapping = aes(x = color, y = cut)) +
geom_tile(mapping = aes(fill = n))- Use
geom_tile()together with dplyr to explore how average flight delays vary by destination and month of year. What makes the plot difficult to read? How could you improve it?
flights %>%
group_by(month, dest) %>%
summarise(avg_delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot(mapping = aes(x = factor(month), y = dest)) +
geom_tile(mapping = aes(fill = avg_delay))The plot is difficult to read because there are many destinations on the y-axis. There are also missing values that could be removed to improve readability.
- Why is it slightly better to use
aes(x = color, y = cut)rather thanaes(x = cut, y = color)in the example above?
The larger counts are located in the top left which makes it more readable.
8. Workflow: projects
No exercises.
10. Tibbles
10.5 Exercises
- How can you tell if an object is a tibble? (Hint: try printing
mtcars, which is a regular data frame).
# A tibble: is displayed in the top left of the output. Also, additional information is printed, such as column types and total number of rows and columns.
as_tibble(mtcars)## # A tibble: 32 x 11
## mpg cyl disp hp drat wt qsec vs am gear carb
## * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21.0 6. 160. 110. 3.90 2.62 16.5 0. 1. 4. 4.
## 2 21.0 6. 160. 110. 3.90 2.88 17.0 0. 1. 4. 4.
## 3 22.8 4. 108. 93. 3.85 2.32 18.6 1. 1. 4. 1.
## 4 21.4 6. 258. 110. 3.08 3.22 19.4 1. 0. 3. 1.
## 5 18.7 8. 360. 175. 3.15 3.44 17.0 0. 0. 3. 2.
## 6 18.1 6. 225. 105. 2.76 3.46 20.2 1. 0. 3. 1.
## 7 14.3 8. 360. 245. 3.21 3.57 15.8 0. 0. 3. 4.
## 8 24.4 4. 147. 62. 3.69 3.19 20.0 1. 0. 4. 2.
## 9 22.8 4. 141. 95. 3.92 3.15 22.9 1. 0. 4. 2.
## 10 19.2 6. 168. 123. 3.92 3.44 18.3 1. 0. 4. 4.
## # ... with 22 more rows
Compare and contrast the following operations on a
data.frameand equivalent tibble. What is different? Why might the default data frame behaviours cause you frustration?df <- data.frame(abc = 1, xyz = "a") df$x df[, "xyz"] df[, c("abc", "xyz")]
df <- tibble(abc = 1, xyz = "a")
df$x## Warning: Unknown or uninitialised column: 'x'.
## NULL
df[, "xyz"]## # A tibble: 1 x 1
## xyz
## <chr>
## 1 a
df[, c("abc", "xyz")]## # A tibble: 1 x 2
## abc xyz
## <dbl> <chr>
## 1 1. a
The default data frame uses partial matching. In comparison, tibbles do not. Data frames also store strings as factors by default.
df <- data.frame(abc = 1, xyz = "a")
glimpse(df)## Observations: 1
## Variables: 2
## $ abc <dbl> 1
## $ xyz <fct> a
- If you have the name of a variable stored in an object, e.g.
var <- "mpg", how can you extract the reference variable from a tibble?
By using the double bracket, [[. For example, df[[var]].
Practice referring to non-syntactic names in the following data frame by:
Extracting the variable called
1.Plotting a scatterplot of
1vs2.Creating a new column called
3which is2divided by1.Renaming the columns to
one,twoandthree.
annoying <- tibble( `1` = 1:10, `2` = `1` * 2 + rnorm(length(`1`)) )
Extracting the variable called 1:
annoying$`1`## [1] 1 2 3 4 5 6 7 8 9 10
Plotting a scatterplot of 1 vs 2:
ggplot(annoying, aes(`1`, `2`)) +
geom_point()Creating a new column called 3 which is 2 divided by 1:
annoying <- annoying %>%
mutate(`3` = `2` / `1`)Renaming the columns to one, two and three:
annoying <- annoying %>%
rename(`one` = `1`,
`two` = `2`,
`three` = `3`)- What does
tibble::enframe()do? When might you use it?
It converts atomic vectors or lists to two-column data frames.
- What option controls how many additional column names are printed at the footer of a tibble?
n_extra from print.tbl(). For example:
print(flights, n_extra = 1)## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2. 830
## 2 2013 1 1 533 529 4. 850
## 3 2013 1 1 542 540 2. 923
## 4 2013 1 1 544 545 -1. 1004
## 5 2013 1 1 554 600 -6. 812
## 6 2013 1 1 554 558 -4. 740
## 7 2013 1 1 555 600 -5. 913
## 8 2013 1 1 557 600 -3. 709
## 9 2013 1 1 557 600 -3. 838
## 10 2013 1 1 558 600 -2. 753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # …
11. Data import
11.2.2 Exercises
- What function would you use to read a file where fields were separated with
“|”?
read_delim(). The delim argument can be used to separate fields with “|”.
- Apart from
file,skip, andcomment, what other arguments doread_csv()andread_tsv()have in common?
Some other arguments they have in common: col_names, col_types, locale, and na. In fact, it appears they share the exact same arguments.
- What are the most important arguments to
read_fwf()?
widths, start, and end.
Sometimes strings in a CSV file contain commas. To prevent them from causing problems they need to be surrounded by a quoting character, like
"or'. By convention,read_csv()assumes that the quoting character will be", and if you want to change it you’ll need to useread_delim()instead. What arguments do you need to specify to read the following text into a data frame?"x,y\n1,'a,b'"
x <- "x,y\n1,'a,b'"
df <- read_delim(x, ",", quote = "'")Identify what is wrong with each of the following inline CSV files. What happens when you run the code?
read_csv("a,b\n1,2,3\n4,5,6") read_csv("a,b,c\n1,2\n1,2,3,4") read_csv("a,b\n\"1") read_csv("a,b\n1,2\na,b") read_csv("a;b\n1;3")
Let’s fix these:
read_csv("a,b,c\n1,2,3\n4,5,6")## # A tibble: 2 x 3
## a b c
## <int> <int> <int>
## 1 1 2 3
## 2 4 5 6
read_csv("a,b,c\n1,2,1\n2,3,4")## # A tibble: 2 x 3
## a b c
## <int> <int> <int>
## 1 1 2 1
## 2 2 3 4
read_csv("a,b\n1")## # A tibble: 1 x 2
## a b
## <int> <chr>
## 1 1 <NA>
It’s unclear what the intent was here:
read_csv("a,b\n1,2\na,b")## # A tibble: 2 x 2
## a b
## <chr> <chr>
## 1 1 2
## 2 a b
read_csv("a,b\n1,3")## # A tibble: 1 x 2
## a b
## <int> <int>
## 1 1 3
11.3.5 Exercises
- What are the most important arguments to
locale()?
date_names and maybe decimal_mark.
- What happens if you try and set
decimal_markandgrouping_markto the same character? What happens to the default value ofgrouping_markwhen you setdecimal_markto “,”? What happens to the default value ofdecimal_markwhen you set thegrouping_markto “.”?
locale(decimal_mark = ",", grouping_mark = ",")## Error: `decimal_mark` and `grouping_mark` must be different
It produces an error.
locale(decimal_mark = ",")## <locale>
## Numbers: 123.456,78
## Formats: %AD / %AT
## Timezone: UTC
## Encoding: UTF-8
## <date_names>
## Days: Sunday (Sun), Monday (Mon), Tuesday (Tue), Wednesday (Wed),
## Thursday (Thu), Friday (Fri), Saturday (Sat)
## Months: January (Jan), February (Feb), March (Mar), April (Apr), May
## (May), June (Jun), July (Jul), August (Aug), September
## (Sep), October (Oct), November (Nov), December (Dec)
## AM/PM: AM/PM
The default value for grouping_mark is changed to “.”.
locale(grouping_mark = ".")## <locale>
## Numbers: 123.456,78
## Formats: %AD / %AT
## Timezone: UTC
## Encoding: UTF-8
## <date_names>
## Days: Sunday (Sun), Monday (Mon), Tuesday (Tue), Wednesday (Wed),
## Thursday (Thu), Friday (Fri), Saturday (Sat)
## Months: January (Jan), February (Feb), March (Mar), April (Apr), May
## (May), June (Jun), July (Jul), August (Aug), September
## (Sep), October (Oct), November (Nov), December (Dec)
## AM/PM: AM/PM
The default value for decimal_mark is changed to “,”.
- I didn’t discuss the
date_formatandtime_formatoptions tolocale(). What do they do? Construct an example that shows when they might be useful.
date_format and time_format allows you to specify the default date and time formats.
str(parse_guess("01/13/2018", locale = locale(date_format = "%m/%d/%Y")))## Date[1:1], format: "2018-01-13"
According to vignette("readr"), time_format isn’t currently used for anything.
- If you live outside the US, create a new locale object that encapsulates the settings for the types of file you read most commonly.
locale(date_names = "en", date_format = "%d/%m/%Y", decimal_mark = ".",
grouping_mark = ",", tz = "Australia/Perth")## <locale>
## Numbers: 123,456.78
## Formats: %d/%m/%Y / %AT
## Timezone: Australia/Perth
## Encoding: UTF-8
## <date_names>
## Days: Sunday (Sun), Monday (Mon), Tuesday (Tue), Wednesday (Wed),
## Thursday (Thu), Friday (Fri), Saturday (Sat)
## Months: January (Jan), February (Feb), March (Mar), April (Apr), May
## (May), June (Jun), July (Jul), August (Aug), September
## (Sep), October (Oct), November (Nov), December (Dec)
## AM/PM: AM/PM
- What’s the difference between
read_csv()andread_csv2()?
read_csv2 uses “;” (i.e., a semi-colon) for separators instead of “,” (a comma).
- What are the most common encodings used in Europe? What are the most common encodings used in Asia? Do some googling to find out.
UTF-8 is the standard worldwide. In Europe, ISO 8859-1 is a common encoding. Shift JIS is a common encoding in Asia.
- Generate the correct format string to parse each of the following dates and times:
d1 <- "January 1, 2010"
d2 <- "2015-Mar-07"
d3 <- "06-Jun-2017"
d4 <- c("August 19 (2015)", "July 1 (2015)")
d5 <- "12/30/14" # Dec 30, 2014
t1 <- "1705"
t2 <- "11:15:10.12 PM"str(parse_date(d1, format = "%B %d, %Y"))## Date[1:1], format: "2010-01-01"
str(parse_date(d2, format = "%Y-%b-%d"))## Date[1:1], format: "2015-03-07"
str(parse_date(d3, format = "%d-%b-%Y"))## Date[1:1], format: "2017-06-06"
str(parse_date(d4, format = "%B %d (%Y)"))## Date[1:2], format: "2015-08-19" "2015-07-01"
str(parse_date(d5, format = "%m/%d/%y"))## Date[1:1], format: "2014-12-30"
str(parse_time(t1, format = "%H%M"))## Classes 'hms', 'difftime' atomic [1:1] 61500
## ..- attr(*, "units")= chr "secs"
str(parse_time(t2, format = "%H:%M:%OS %p"))## Classes 'hms', 'difftime' atomic [1:1] 83710
## ..- attr(*, "units")= chr "secs"
12. Tidy data
12.2.1 Exercises
- Using prose, describe how the variables and observations are organised in each of the sample tables.
In table1, each variable has its own column, each observation its own row, and each value its own cell. Therefore, table1 is a tidy dataset.
In table2 the cases and population do not have their own columns, but rather, are treated as values. table3 does not have the cases and population variables stored separately. Instead, both are stored as part of a calculation under rate. table4a stores the values of the year variable as separate variables (with their own columns), with the values being the number of cases (population is missing). table 4b is similar, but uses population for the values rather than cases.
Compute the
ratefortable2, andtable4a+table4b. You will need to perform four operations:- Extract the number of TB cases per country per year.
- Extract the matching population per country per year.
- Divide cases by population, and multiply by 10000.
- Store back in the appropriate place.
Which representation is easiest to work with? Which is hardest? Why?
Extract the number of TB cases per country per year:
table2_cases <- filter(table2, type == "cases")[["count"]]Extract the matching population per country per year:
table2_population <- filter(table2, type == "population")[["count"]]Divide cases by population, and multiply by 10000:
table2_rate <- table2_cases / table2_population * 10000Store back in the appropriate place:
table2_country <- filter(table2, type == "cases")[["country"]]
table2_year <- filter(table2, type == "cases")[["year"]]
table2_new <- tibble(
country = table2_country,
year = table2_year,
cases = table2_cases,
population = table2_population,
rate = table2_rate
)
table2_new## # A tibble: 6 x 5
## country year cases population rate
## <chr> <int> <int> <int> <dbl>
## 1 Afghanistan 1999 745 19987071 0.373
## 2 Afghanistan 2000 2666 20595360 1.29
## 3 Brazil 1999 37737 172006362 2.19
## 4 Brazil 2000 80488 174504898 4.61
## 5 China 1999 212258 1272915272 1.67
## 6 China 2000 213766 1280428583 1.67
For table4a and table4b:
table4_new <- tibble(
country = table4a[["country"]],
`1999` = table4a[["1999"]] / table4b[["1999"]] * 10000,
`2000` = table4a[["2000"]] / table4b[["2000"]] * 10000
)
table4_new## # A tibble: 3 x 3
## country `1999` `2000`
## <chr> <dbl> <dbl>
## 1 Afghanistan 0.373 1.29
## 2 Brazil 2.19 4.61
## 3 China 1.67 1.67
- Recreate the plot showing change in cases over time using
table2instead oftable1. What do you need to do first?
Filter cases:
table2 %>%
filter(type == "cases") %>%
ggplot(aes(year, count, group = country, colour = country)) +
geom_line() +
geom_point()12.3.3 Exercises
Why are
gather()andspread()not perfectly symmetrical?
Carefully consider the following example:stocks <- tibble( year = c(2015, 2015, 2016, 2016), half = c( 1, 2, 1, 2), return = c(1.88, 0.59, 0.92, 0.17) ) stocks %>% spread(year, return) %>% gather("year", "return", `2015`:`2016`)(Hint: look at the variable types and think about column names.)
Both
spread()andgather()have aconvertargument. What does it do?
gather() and spread() are not perfectly symmetrical because variable type is not retained when using both functions, as seen in the example above. In stocks, year is numeric:
stocks %>% select(year)## # A tibble: 4 x 1
## year
## <dbl>
## 1 2015.
## 2 2015.
## 3 2016.
## 4 2016.
However, after using the spread() and gather() functions it becomes character:
stocks %>%
spread(year, return) %>%
gather("year", "return", `2015`:`2016`) %>%
select(year)## # A tibble: 4 x 1
## year
## <chr>
## 1 2015
## 2 2015
## 3 2016
## 4 2016
The convert argument, when set to TRUE, will retain numeric, integer, or logical column types when using spread() or gather().
Why does this code fail?
table4a %>% gather(1999, 2000, key = "year", value = "cases")## Error in inds_combine(.vars, ind_list): Position must be between 0 and n
“1999” and “2000” are non-syntactic names so they have to be surrounded by backticks.
table4a %>%
gather(`1999`, `2000`, key = "year", value = "cases")## # A tibble: 6 x 3
## country year cases
## <chr> <chr> <int>
## 1 Afghanistan 1999 745
## 2 Brazil 1999 37737
## 3 China 1999 212258
## 4 Afghanistan 2000 2666
## 5 Brazil 2000 80488
## 6 China 2000 213766
Why does spreading this tibble fail? How could you add a new column to fix the problem?
people <- tribble( ~name, ~key, ~value, #-----------------|--------|------ "Phillip Woods", "age", 45, "Phillip Woods", "height", 186, "Phillip Woods", "age", 50, "Jessica Cordero", "age", 37, "Jessica Cordero", "height", 156 )
people %>% spread(key, value)## Error: Duplicate identifiers for rows (1, 3)
Phillip Woods’ age has been entered twice. You could add a new column that indicates the time of observation:
people <- tribble(
~name, ~key, ~value, ~time,
#-----------------|--------|------|-------
"Phillip Woods", "age", 45, 1,
"Phillip Woods", "height", 186, 1,
"Phillip Woods", "age", 50, 2,
"Jessica Cordero", "age", 37, 1,
"Jessica Cordero", "height", 156, 1
)people %>% spread(key, value)## # A tibble: 3 x 4
## name time age height
## <chr> <dbl> <dbl> <dbl>
## 1 Jessica Cordero 1. 37. 156.
## 2 Phillip Woods 1. 45. 186.
## 3 Phillip Woods 2. 50. NA
Tidy the simple tibble below. Do you need to spread or gather it? What are the variables?
preg <- tribble( ~pregnant, ~male, ~female, "yes", NA, 10, "no", 20, 12 )
It needs to be gathered. The variables are sex (male or female) and pregnant (yes or no).
preg %>% gather(
male, female,
key = "sex",
value = "count"
)## # A tibble: 4 x 3
## pregnant sex count
## <chr> <chr> <dbl>
## 1 yes male NA
## 2 no male 20.
## 3 yes female 10.
## 4 no female 12.
12.4.3 Exercises
What do the
extraandfillarguments do inseparate()? Experiment with the various options for the following two toy datasets.tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>% separate(x, c("one", "two", "three")) tibble(x = c("a,b,c", "d,e", "f,g,i")) %>% separate(x, c("one", "two", "three"))
extra controls what happens when you use a character vector in sep and there are too many pieces.
tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
separate(x, c("one", "two", "three"), extra = "merge")## # A tibble: 3 x 3
## one two three
## <chr> <chr> <chr>
## 1 a b c
## 2 d e f,g
## 3 h i j
In comparison, fill controls what happens when there are not enough pieces.
tibble(x = c("a,b,c", "d,e", "f,g,i")) %>%
separate(x, c("one", "two", "three"), fill = "left")## # A tibble: 3 x 3
## one two three
## <chr> <chr> <chr>
## 1 a b c
## 2 <NA> d e
## 3 f g i
- Both
unite()andseparate()have aremoveargument. What does it do? Why would you set it toFALSE?
It removes the input column from the data frame. By setting this to FALSE, you can retain the original input column in the dataset:
tibble(x = c("a,b,c", "d,e,f", "g,h,i")) %>%
separate(x, c("one", "two", "three"), remove = "FALSE")## # A tibble: 3 x 4
## x one two three
## <chr> <chr> <chr> <chr>
## 1 a,b,c a b c
## 2 d,e,f d e f
## 3 g,h,i g h i
- Compare and contrast
separate()andextract(). Why are there three variations of separation (by position, by separator, and with groups), but only one unite?
tibble(x = c("a,b,c", "d,e,f", "g,h,i")) %>%
separate(x, c("one", "two", "three"))## # A tibble: 3 x 3
## one two three
## <chr> <chr> <chr>
## 1 a b c
## 2 d e f
## 3 g h i
tibble(x = c("a,b,c", "d,e,f", "g,h,i"))## # A tibble: 3 x 1
## x
## <chr>
## 1 a,b,c
## 2 d,e,f
## 3 g,h,i
tibble(x = c("a,b,c", "d,e,f", "g,h,i")) %>%
extract(x, "A")## # A tibble: 3 x 1
## A
## <chr>
## 1 a
## 2 d
## 3 g
separate() pulls apart one column and places the values into many columns. extract() is a similar function, but, as its name implies, only retains values you wish to extract from a data frame.
unite() only requires one variation of separation because it is combining multiple columns into one column. Therefore, there is only one way to do it (using sep). In comparison, there are multiple ways to split a character string for separate() (e.g., by position or by separator).
12.5.1 Exercises
- Compare and contrast the
fillarguments tospread()andcomplete().
For spread(), fill will replace missing values in a data set with this value. In comparison, the fill argument for complete() allows you to replace missing values by column name.
- What does the direction argument to
fill()do?
It is the direction in which to fill missing values. For example:
df <- data.frame(Month = 1:12, Year = c(NA, 2000, rep(NA, 10)))
df %>% fill(Year, .direction = c("up"))## Month Year
## 1 1 2000
## 2 2 2000
## 3 3 NA
## 4 4 NA
## 5 5 NA
## 6 6 NA
## 7 7 NA
## 8 8 NA
## 9 9 NA
## 10 10 NA
## 11 11 NA
## 12 12 NA
12.6.1 Exercises
- In this case study I set
na.rm = TRUEjust to make it easier to check that we had the correct values. Is this reasonable? Think about how missing values are represented in this dataset. Are there implicit missing values? What’s the difference between anNAand zero?
I think this is reasonable. Missing values appear to be represented explicitely in this dataset:
who %>%
gather(new_sp_m014:newrel_f65, key = "key", value = "cases") %>%
count(is.na(cases))## # A tibble: 2 x 2
## `is.na(cases)` n
## <lgl> <int>
## 1 FALSE 76046
## 2 TRUE 329394
There are over 76,000 NA values for cases.
It is difficult to determine whether there are implicit missing values in a dataset because, as stated earlier in the chapter, implicit missing values are an “absence of a presence”.
NA indicates an explicit missing value. So, the number of actual cases for an NA observation is not necessarily zero. In comparison, a value of zero indicates there were no cases for a particular observation.
who %>%
gather(new_sp_m014:newrel_f65, key = "key", value = "cases") %>%
filter(cases == 0) %>%
nrow()## [1] 11080
- What happens if you neglect the
mutate()step? (mutate(key = stringr::str_replace(key, "newrel", "new_rel")))
who %>%
gather(code, value, new_sp_m014:newrel_f65, na.rm = TRUE) %>%
separate(code, c("new", "var", "sexage")) %>%
select(-new, -iso2, -iso3) %>%
separate(sexage, c("sex", "age"), sep = 1)## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 2580 rows
## [73467, 73468, 73469, 73470, 73471, 73472, 73473, 73474, 73475, 73476,
## 73477, 73478, 73479, 73480, 73481, 73482, 73483, 73484, 73485, 73486, ...].
## # A tibble: 76,046 x 6
## country year var sex age value
## * <chr> <int> <chr> <chr> <chr> <int>
## 1 Afghanistan 1997 sp m 014 0
## 2 Afghanistan 1998 sp m 014 30
## 3 Afghanistan 1999 sp m 014 8
## 4 Afghanistan 2000 sp m 014 52
## 5 Afghanistan 2001 sp m 014 129
## 6 Afghanistan 2002 sp m 014 90
## 7 Afghanistan 2003 sp m 014 127
## 8 Afghanistan 2004 sp m 014 139
## 9 Afghanistan 2005 sp m 014 151
## 10 Afghanistan 2006 sp m 014 193
## # ... with 76,036 more rows
It won’t split the string newrel properly.
split_problem <- who %>%
gather(code, value, new_sp_m014:newrel_f65, na.rm = TRUE) %>%
separate(code, c("new", "var", "sexage")) %>%
select(-new, -iso2, -iso3) %>%
separate(sexage, c("sex", "age"), sep = 1)## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 2580 rows
## [73467, 73468, 73469, 73470, 73471, 73472, 73473, 73474, 73475, 73476,
## 73477, 73478, 73479, 73480, 73481, 73482, 73483, 73484, 73485, 73486, ...].
split_problem %>% slice(73467:73486)## # A tibble: 20 x 6
## country year var sex age value
## <chr> <int> <chr> <chr> <chr> <int>
## 1 Afghanistan 2013 m014 <NA> <NA> 1705
## 2 Albania 2013 m014 <NA> <NA> 14
## 3 Algeria 2013 m014 <NA> <NA> 25
## 4 Andorra 2013 m014 <NA> <NA> 0
## 5 Angola 2013 m014 <NA> <NA> 486
## 6 Anguilla 2013 m014 <NA> <NA> 0
## 7 Antigua and Barbuda 2013 m014 <NA> <NA> 1
## 8 Argentina 2013 m014 <NA> <NA> 462
## 9 Armenia 2013 m014 <NA> <NA> 25
## 10 Australia 2013 m014 <NA> <NA> 28
## 11 Austria 2013 m014 <NA> <NA> 15
## 12 Azerbaijan 2013 m014 <NA> <NA> 125
## 13 Bahamas 2013 m014 <NA> <NA> 3
## 14 Bahrain 2013 m014 <NA> <NA> 4
## 15 Bangladesh 2013 m014 <NA> <NA> 2323
## 16 Barbados 2013 m014 <NA> <NA> 0
## 17 Belarus 2013 m014 <NA> <NA> 4
## 18 Belgium 2013 m014 <NA> <NA> 40
## 19 Belize 2013 m014 <NA> <NA> 1
## 20 Benin 2013 m014 <NA> <NA> 25
For relapse cases, gender and age group have not separated (they appear under var), and sex and age variables are NA.
- I claimed that
iso2andiso3were redundant withcountry. Confirm this claim.
For each country, there is only one unique combination of iso2 and iso3:
who %>%
gather(new_sp_m014:newrel_f65, key = "key", value = "cases", na.rm = TRUE) %>%
select(country, iso2, iso3) %>%
distinct() %>%
group_by(country) %>%
filter(n() > 1)## # A tibble: 0 x 3
## # Groups: country [0]
## # ... with 3 variables: country <chr>, iso2 <chr>, iso3 <chr>
- For each country, year, and sex compute the total number of cases of TB. Make an informative visualisation of the data.
who_cases <- who %>%
gather(code, value, new_sp_m014:newrel_f65, na.rm = TRUE) %>%
mutate(code = stringr::str_replace(code, "newrel", "new_rel")) %>%
separate(code, c("new", "var", "sexage")) %>%
select(-new, -iso2, -iso3) %>%
separate(sexage, c("sex", "age"), sep = 1) %>%
group_by(country, year, sex) %>%
summarise(cases = sum(value))
who_cases## # A tibble: 6,921 x 4
## # Groups: country, year [?]
## country year sex cases
## <chr> <int> <chr> <int>
## 1 Afghanistan 1997 f 102
## 2 Afghanistan 1997 m 26
## 3 Afghanistan 1998 f 1207
## 4 Afghanistan 1998 m 571
## 5 Afghanistan 1999 f 517
## 6 Afghanistan 1999 m 228
## 7 Afghanistan 2000 f 1751
## 8 Afghanistan 2000 m 915
## 9 Afghanistan 2001 f 3062
## 10 Afghanistan 2001 m 1577
## # ... with 6,911 more rows
who_cases %>%
group_by(sex, year) %>%
filter(year > 1995) %>%
summarise(cases_total = sum(cases)) %>%
ggplot(aes(x = year, y = cases_total, group = sex, colour = sex)) +
geom_line()13. Relational data
13.2.1 Exercises
- Imagine you wanted to draw (approximately) the route each plane flies from its origin to its destination. What variables would you need? What tables would you need to combine?
From flights, you would need origin and dest. From airports, lat, lon (the locations of the airport), and maybe alt would be required. You would need to combine the flights and airports tables to get the origin airport and destination airport.
The weather table may also be helpful as wind speed, wind direction, and visibility may affect the route.
- I forgot to draw the relationship between
weatherandairports. What is the relationship and how should it appear in the diagram?
weather connects to airports via origin (the location). In the diagram, there should be an arrow pointing from faa in the airports table to origin in the weather table.
weatheronly contains information for the origin (NYC) airports. If it contained weather records for all airports in the USA, what additional relation would it define withflights?
year, month, day, and hour.
- We know that some days of the year are “special”, and fewer people than usual fly on them. How might you represent that data as a data frame? What would be the primary keys of that table? How would it connect to the existing tables?
You would have date and name variables. The primary key of this table would be date. It would match to year, month, and day in other tables such as flights.
13.3.1 Exercises
- Add a surrogate key to
flights.
flights %>%
arrange(year, month, day, sched_dep_time) %>%
mutate(flight_id = row_number())## # A tibble: 336,776 x 20
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2. 830
## 2 2013 1 1 533 529 4. 850
## 3 2013 1 1 542 540 2. 923
## 4 2013 1 1 544 545 -1. 1004
## 5 2013 1 1 554 558 -4. 740
## 6 2013 1 1 559 559 0. 702
## 7 2013 1 1 554 600 -6. 812
## 8 2013 1 1 555 600 -5. 913
## 9 2013 1 1 557 600 -3. 709
## 10 2013 1 1 557 600 -3. 838
## # ... with 336,766 more rows, and 13 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, flight_id <int>
Identify the keys in the following datasets
Lahman::Batting,babynames::babynamesnasaweather::atmosfueleconomy::vehiclesggplot2::diamonds
(You might need to install some packages and read some documentation.)
Lahman::Batting:
The primary keys for Lahman::Batting are playerID, yearID, and stint.
Lahman::Batting %>%
count(playerID, yearID, stint) %>%
filter(n > 1)## # A tibble: 0 x 4
## # ... with 4 variables: playerID <chr>, yearID <int>, stint <int>, n <int>
playerID is a foreign key (it is the primary key for Lahman::Master).
babynames::babynames:
The primary keys for babynames::babynames are year, sex, and name.
babynames::babynames %>%
group_by(year, sex, name) %>%
filter(n() > 1)## # A tibble: 0 x 5
## # Groups: year, sex, name [0]
## # ... with 5 variables: year <dbl>, sex <chr>, name <chr>, n <int>,
## # prop <dbl>
nasaweather::atmos:
For nasaweather:atmos, the primary keys are lat, long, year, and month.
nasaweather::atmos %>%
group_by(lat, long, year, month) %>%
filter(n() > 1)## # A tibble: 0 x 11
## # Groups: lat, long, year, month [0]
## # ... with 11 variables: lat <dbl>, long <dbl>, year <int>, month <int>,
## # surftemp <dbl>, temp <dbl>, pressure <dbl>, ozone <dbl>,
## # cloudlow <dbl>, cloudmid <dbl>, cloudhigh <dbl>
fueleconomy::vehicles:
The primary key for fueleconomy::vehicles is id.
fueleconomy::vehicles %>%
count(id) %>%
filter(n > 1)## # A tibble: 0 x 2
## # ... with 2 variables: id <int>, n <int>
ggplot2::diamonds:
ggplot2::diamonds does not appear to have a primary key.
ggplot2::diamonds %>%
group_by(carat, cut, color, clarity, depth, table, price, x, y, z) %>%
filter(n() > 1) %>%
nrow()## [1] 289
Draw a diagram illustrating the connections between the
Batting,Master, andSalariestables in the Lahman package. Draw another diagram that shows the relationship betweenMaster,Managers,AwardsManagers.How would you characterise the relationship between the
Batting,Pitching, andFieldingtables?
13.4.6 Exercises
Compute the average delay by destination, then join on the
airportsdata frame so you can show the spatial distribution of delays. Here’s an easy way to draw a map of the United States:airports %>% semi_join(flights, c("faa" = "dest")) %>% ggplot(aes(lon, lat)) + borders("state") + geom_point() + coord_quickmap()(Don’t worry if you don’t understand what
semi_join()does — you’ll learn about it next.)You might want to use the
sizeorcolourof the points to display the average delay for each airport.
dest_delay <- flights %>%
group_by(dest) %>%
summarise(avg_delay = mean(arr_delay, na.rm = TRUE))
dest_delay %>%
left_join(airports, c("dest" = "faa")) %>%
ggplot(aes(lon, lat)) +
borders("state") +
geom_point(aes(colour = avg_delay)) +
coord_quickmap()## Warning: Removed 4 rows containing missing values (geom_point).
- Add the location of the origin and destination (i.e. the
latandlon) toflights.
flights %>%
left_join(airports, c("origin" = "faa")) %>%
left_join(airports, c("dest" = "faa")) %>%
rename(lat_origin = lat.x, lon_origin = lon.x) %>%
rename(lat_dest = lat.y, lon_dest = lon.y) %>%
select(-contains("."))## # A tibble: 336,776 x 23
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2. 830
## 2 2013 1 1 533 529 4. 850
## 3 2013 1 1 542 540 2. 923
## 4 2013 1 1 544 545 -1. 1004
## 5 2013 1 1 554 600 -6. 812
## 6 2013 1 1 554 558 -4. 740
## 7 2013 1 1 555 600 -5. 913
## 8 2013 1 1 557 600 -3. 709
## 9 2013 1 1 557 600 -3. 838
## 10 2013 1 1 558 600 -2. 753
## # ... with 336,766 more rows, and 16 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, lat_origin <dbl>, lon_origin <dbl>,
## # lat_dest <dbl>, lon_dest <dbl>
- Is there a relationship between the age of a plane and its delays?
planes_avg_delay <- planes %>%
select(tailnum, year) %>%
left_join(flights, "tailnum") %>%
group_by(tailnum, year.x) %>%
rename(year = year.x) %>%
summarise(avg_delay = mean(arr_delay, na.rm = TRUE))
ggplot(planes_avg_delay, aes(year, avg_delay)) +
geom_point() +
geom_smooth()## `geom_smooth()` using method = 'gam'
## Warning: Removed 76 rows containing non-finite values (stat_smooth).
## Warning: Removed 76 rows containing missing values (geom_point).
It appears not.
- What weather conditions make it more likely to see a delay?
I think rain and low visibility would increase the likelihood of delay.
flights_weather <- flights %>%
inner_join(weather, c("origin", "year", "month", "day", "hour"))
flights_weather %>%
group_by(visib) %>%
summarise(avg_delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot(aes(visib, avg_delay)) +
geom_point() +
geom_line() +
geom_smooth(method = lm, se = FALSE)As visibility increases, the average delay decreases.
flights_weather %>%
group_by(precip) %>%
summarise(avg_delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot(aes(precip, avg_delay)) +
geom_point() +
geom_line() +
geom_smooth(method = lm, se = FALSE)The relationship between rainfall and average delay is less clear.
- What happened on June 13 2013? Display the spatial pattern of delays, and then use Google to cross-reference with the weather.
There were a large series of storms in the south-east.
flights %>%
filter(month == 6, day == 13) %>%
group_by(hour) %>%
summarise(avg_delay = mean(dep_delay, na.rm = TRUE))## # A tibble: 19 x 2
## hour avg_delay
## <dbl> <dbl>
## 1 5. -2.83
## 2 6. 3.54
## 3 7. 2.81
## 4 8. 7.91
## 5 9. 23.2
## 6 10. 27.8
## 7 11. 35.6
## 8 12. 61.3
## 9 13. 51.8
## 10 14. 61.5
## 11 15. 59.2
## 12 16. 52.0
## 13 17. 63.8
## 14 18. 70.0
## 15 19. 92.9
## 16 20. 87.9
## 17 21. 83.3
## 18 22. 103.
## 19 23. 25.3
By mid-morning there were significant delays. This continued into the evening.
flights %>%
filter(month == 6, day == 13) %>%
group_by(dest) %>%
summarise(avg_delay = mean(dep_delay, na.rm = TRUE)) %>%
inner_join(airports, by = c("dest" = "faa")) %>%
ggplot(aes(lon, lat)) +
borders("state") +
geom_point(aes(colour = avg_delay, size = avg_delay)) +
coord_quickmap()## Warning: Removed 3 rows containing missing values (geom_point).
13.5.1 Exercises
- What does it mean for a flight to have a missing
tailnum? What do the tail numbers that don’t have a matching record inplaneshave in common? (Hint: one variable explains ~90% of the problems.)
flights %>%
anti_join(planes, by = "tailnum")## # A tibble: 52,606 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 558 600 -2. 753
## 2 2013 1 1 559 600 -1. 941
## 3 2013 1 1 600 600 0. 837
## 4 2013 1 1 602 605 -3. 821
## 5 2013 1 1 608 600 8. 807
## 6 2013 1 1 611 600 11. 945
## 7 2013 1 1 623 610 13. 920
## 8 2013 1 1 624 630 -6. 840
## 9 2013 1 1 628 630 -2. 1137
## 10 2013 1 1 629 630 -1. 824
## # ... with 52,596 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
Many of the flights without a tail number in planes appear to be American Airlines (AA) or Envoy Air (MQ).
flights %>%
anti_join(planes, by = "tailnum") %>%
group_by(carrier) %>%
summarise(count = n())## # A tibble: 10 x 2
## carrier count
## <chr> <int>
## 1 9E 1044
## 2 AA 22558
## 3 B6 830
## 4 DL 110
## 5 F9 50
## 6 FL 187
## 7 MQ 25397
## 8 UA 1693
## 9 US 699
## 10 WN 38
flights %>%
anti_join(planes, by = "tailnum") %>%
nrow()## [1] 52606
Specifically, about 90% of flights.
- Filter flights to only show flights with planes that have flown at least 100 flights.
flights_100 <- flights %>%
group_by(tailnum) %>%
count() %>%
filter(n >= 100 & tailnum != "NA") %>%
arrange(desc(n))
flights %>%
semi_join(flights_100, by = "tailnum")## # A tibble: 228,390 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2. 830
## 2 2013 1 1 533 529 4. 850
## 3 2013 1 1 544 545 -1. 1004
## 4 2013 1 1 554 558 -4. 740
## 5 2013 1 1 555 600 -5. 913
## 6 2013 1 1 557 600 -3. 709
## 7 2013 1 1 557 600 -3. 838
## 8 2013 1 1 558 600 -2. 849
## 9 2013 1 1 558 600 -2. 853
## 10 2013 1 1 558 600 -2. 923
## # ... with 228,380 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
- Combine
fueleconomy::vehiclesandfueleconomy::commonto find only the records for the most common models.
fueleconomy::vehicles %>%
semi_join(fueleconomy::common, by = c("make", "model"))## # A tibble: 14,531 x 12
## id make model year class trans drive cyl displ fuel hwy cty
## <int> <chr> <chr> <int> <chr> <chr> <chr> <int> <dbl> <chr> <int> <int>
## 1 1833 Acura Inte… 1986 Subc… Auto… Fron… 4 1.60 Regu… 28 22
## 2 1834 Acura Inte… 1986 Subc… Manu… Fron… 4 1.60 Regu… 28 23
## 3 3037 Acura Inte… 1987 Subc… Auto… Fron… 4 1.60 Regu… 28 22
## 4 3038 Acura Inte… 1987 Subc… Manu… Fron… 4 1.60 Regu… 28 23
## 5 4183 Acura Inte… 1988 Subc… Auto… Fron… 4 1.60 Regu… 27 22
## 6 4184 Acura Inte… 1988 Subc… Manu… Fron… 4 1.60 Regu… 28 23
## 7 5303 Acura Inte… 1989 Subc… Auto… Fron… 4 1.60 Regu… 27 22
## 8 5304 Acura Inte… 1989 Subc… Manu… Fron… 4 1.60 Regu… 28 23
## 9 6442 Acura Inte… 1990 Subc… Auto… Fron… 4 1.80 Regu… 24 20
## 10 6443 Acura Inte… 1990 Subc… Manu… Fron… 4 1.80 Regu… 26 21
## # ... with 14,521 more rows
Find the 48 hours (over the course of the whole year) that have the worst delays. Cross-reference it with the
weatherdata. Can you see any patterns?What does
anti_join(flights, airports, by = c("dest" = "faa"))tell you? What doesanti_join(airports, flights, by = c("faa" = "dest"))tell you?
anti_join(flights, airports, by = c("dest" = "faa"))## # A tibble: 7,602 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 544 545 -1. 1004
## 2 2013 1 1 615 615 0. 1039
## 3 2013 1 1 628 630 -2. 1137
## 4 2013 1 1 701 700 1. 1123
## 5 2013 1 1 711 715 -4. 1151
## 6 2013 1 1 820 820 0. 1254
## 7 2013 1 1 820 820 0. 1249
## 8 2013 1 1 840 845 -5. 1311
## 9 2013 1 1 909 810 59. 1331
## 10 2013 1 1 913 918 -5. 1346
## # ... with 7,592 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
The destination of these flights is unknown (i.e., the airport does not appear in airports). They could be international flights.
anti_join(airports, flights, by = c("faa" = "dest"))## # A tibble: 1,357 x 8
## faa name lat lon alt tz dst tzone
## <chr> <chr> <dbl> <dbl> <int> <dbl> <chr> <chr>
## 1 04G Lansdowne Airport 41.1 -80.6 1044 -5. A America/New_…
## 2 06A Moton Field Municip… 32.5 -85.7 264 -6. A America/Chic…
## 3 06C Schaumburg Regional 42.0 -88.1 801 -6. A America/Chic…
## 4 06N Randall Airport 41.4 -74.4 523 -5. A America/New_…
## 5 09J Jekyll Island Airpo… 31.1 -81.4 11 -5. A America/New_…
## 6 0A9 Elizabethton Munici… 36.4 -82.2 1593 -5. A America/New_…
## 7 0G6 Williams County Air… 41.5 -84.5 730 -5. A America/New_…
## 8 0G7 Finger Lakes Region… 42.9 -76.8 492 -5. A America/New_…
## 9 0P2 Shoestring Aviation… 39.8 -76.6 1000 -5. U America/New_…
## 10 0S9 Jefferson County In… 48.1 -123. 108 -8. A America/Los_…
## # ... with 1,347 more rows
These are the destination airports for all flights that departed NYC in 2013.
- You might expect that there’s an implicit relationship between plane and airline, because each plane is flown by a single airline. Confirm or reject this hypothesis using the tools you’ve learned above.
flights_carrier <- flights %>%
group_by(tailnum, carrier) %>%
count() %>%
arrange(tailnum)
flights_carrier## # A tibble: 4,067 x 3
## # Groups: tailnum, carrier [4,067]
## tailnum carrier n
## <chr> <chr> <int>
## 1 D942DN DL 4
## 2 N0EGMQ MQ 371
## 3 N10156 EV 153
## 4 N102UW US 48
## 5 N103US US 46
## 6 N104UW US 47
## 7 N10575 EV 289
## 8 N105UW US 45
## 9 N107US US 41
## 10 N108UW US 60
## # ... with 4,057 more rows
flights_carrier %>%
group_by(tailnum) %>%
count() %>%
filter(nn > 1)## # A tibble: 18 x 2
## # Groups: tailnum [18]
## tailnum nn
## <chr> <int>
## 1 N146PQ 2
## 2 N153PQ 2
## 3 N176PQ 2
## 4 N181PQ 2
## 5 N197PQ 2
## 6 N200PQ 2
## 7 N228PQ 2
## 8 N232PQ 2
## 9 N933AT 2
## 10 N935AT 2
## 11 N977AT 2
## 12 N978AT 2
## 13 N979AT 2
## 14 N981AT 2
## 15 N989AT 2
## 16 N990AT 2
## 17 N994AT 2
## 18 <NA> 7
There are a number of planes that have been flown by multiple airlines. However, the majority of planes have had a single airline.
14. Strings
To come…
15. Factors
15.3.1 Exercises
- Explore the distribution of
rincome(reported income). What makes the default bar chart hard to understand? How could you improve the plot?
summary(gss_cat$rincome)## No answer Don't know Refused $25000 or more $20000 - 24999
## 183 267 975 7363 1283
## $15000 - 19999 $10000 - 14999 $8000 to 9999 $7000 to 7999 $6000 to 6999
## 1048 1168 340 188 215
## $5000 to 5999 $4000 to 4999 $3000 to 3999 $1000 to 2999 Lt $1000
## 227 226 276 395 286
## Not applicable
## 7043
ggplot(gss_cat, aes(rincome)) +
geom_bar()This plot can be improved by rotating the labels on the x-axis:
ggplot(gss_cat, aes(rincome)) +
geom_bar() +
theme(axis.text.x = element_text(angle = 35, hjust = 1))It would also be better if the categories on the x-axis were ordered from smallest to largest. To do this, you would have to modify the order of the factor levels. The following categories can also be combined: “No answer”, “Don’t know”, “Refused”, and “Not applicable”.
- What is the most common
religin this survey? What’s the most commonpartyid?
gss_cat %>%
count(relig) %>%
arrange(desc(n)) %>%
head(1)## # A tibble: 1 x 2
## relig n
## <fct> <int>
## 1 Protestant 10846
gss_cat %>%
count(partyid) %>%
arrange(desc(n)) %>%
head(1)## # A tibble: 1 x 2
## partyid n
## <fct> <int>
## 1 Independent 4119
“Protestant”" is the most common religion.
The most common party affiliation is “Independent”.
- Which
religdoesdenom(denomination) apply to? How can you find out with a table? How can you find out with a visualisation?
levels(gss_cat$denom)## [1] "No answer" "Don't know" "No denomination"
## [4] "Other" "Episcopal" "Presbyterian-dk wh"
## [7] "Presbyterian, merged" "Other presbyterian" "United pres ch in us"
## [10] "Presbyterian c in us" "Lutheran-dk which" "Evangelical luth"
## [13] "Other lutheran" "Wi evan luth synod" "Lutheran-mo synod"
## [16] "Luth ch in america" "Am lutheran" "Methodist-dk which"
## [19] "Other methodist" "United methodist" "Afr meth ep zion"
## [22] "Afr meth episcopal" "Baptist-dk which" "Other baptists"
## [25] "Southern baptist" "Nat bapt conv usa" "Nat bapt conv of am"
## [28] "Am bapt ch in usa" "Am baptist asso" "Not applicable"
no_denom <- gss_cat %>%
filter(!denom %in% c("No answer", "Don't know", "No denomination", "Not applicable", "Refused"))no_denom %>%
count(relig)## # A tibble: 1 x 2
## relig n
## <fct> <int>
## 1 Protestant 9559
ggplot(no_denom, aes(relig, denom)) +
geom_count(aes(colour = ..n.., size = ..n..)) +
guides(colour = 'legend')Denomination applies to “Protestant”.
15.4.1 Exercises
- There are some suspiciously high numbers in
tvhours. Is the mean a good summary?
summary(gss_cat$tvhours)## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 1.000 2.000 2.981 4.000 24.000 10146
ggplot(gss_cat, aes(tvhours)) +
geom_bar()## Warning: Removed 10146 rows containing non-finite values (stat_count).
gss_cat %>%
count(tvhours) %>%
arrange(desc(tvhours))## # A tibble: 25 x 2
## tvhours n
## <int> <int>
## 1 24 22
## 2 23 1
## 3 22 2
## 4 21 2
## 5 20 14
## 6 18 7
## 7 17 2
## 8 16 10
## 9 15 17
## 10 14 24
## # ... with 15 more rows
It is unlikely that people are watching 24 hours of tv per day. As seen in the bar plot, tvhours is skewed to the right due to the presence of these extreme values. The median would be a better summary of the data, because, unlike the mean, it is largely unaffected by outliers.
- For each factor in
gss_catidentify whether the order of the levels is arbitrary or principled.
glimpse(gss_cat)## Observations: 21,483
## Variables: 9
## $ year <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, ...
## $ marital <fct> Never married, Divorced, Widowed, Never married, Divor...
## $ age <int> 26, 48, 67, 39, 25, 25, 36, 44, 44, 47, 53, 52, 52, 51...
## $ race <fct> White, White, White, White, White, White, White, White...
## $ rincome <fct> $8000 to 9999, $8000 to 9999, Not applicable, Not appl...
## $ partyid <fct> Ind,near rep, Not str republican, Independent, Ind,nea...
## $ relig <fct> Protestant, Protestant, Protestant, Orthodox-christian...
## $ denom <fct> Southern baptist, Baptist-dk which, No denomination, N...
## $ tvhours <int> 12, NA, 2, 4, 1, NA, 3, NA, 0, 3, 2, NA, 1, NA, 1, 7, ...
levels(gss_cat$marital)## [1] "No answer" "Never married" "Separated" "Divorced"
## [5] "Widowed" "Married"
levels(gss_cat$rincome)## [1] "No answer" "Don't know" "Refused" "$25000 or more"
## [5] "$20000 - 24999" "$15000 - 19999" "$10000 - 14999" "$8000 to 9999"
## [9] "$7000 to 7999" "$6000 to 6999" "$5000 to 5999" "$4000 to 4999"
## [13] "$3000 to 3999" "$1000 to 2999" "Lt $1000" "Not applicable"
marital, race partyid, relig, and denom are arbitrary (i.e., nominal). rincome is principled (i.e., ordinal).
- Why did moving “Not applicable” to the front of the levels move it to the bottom of the plot?
rincome_summary <- gss_cat %>%
group_by(rincome) %>%
summarise(
age = mean(age, na.rm = TRUE),
tvhours = mean(tvhours, na.rm = TRUE),
n = n()
)
ggplot(rincome_summary, aes(age, fct_relevel(rincome, "Not applicable"))) +
geom_point()Because fct_relevel gives “Not applicable” an integer value of 1:
rincome_levels <- fct_relevel(rincome_summary$rincome, "Not applicable")
levels(rincome_levels)## [1] "Not applicable" "No answer" "Don't know" "Refused"
## [5] "$25000 or more" "$20000 - 24999" "$15000 - 19999" "$10000 - 14999"
## [9] "$8000 to 9999" "$7000 to 7999" "$6000 to 6999" "$5000 to 5999"
## [13] "$4000 to 4999" "$3000 to 3999" "$1000 to 2999" "Lt $1000"
15.5.1 Exercises
- How have the proportions of people identifying as Democrat, Republican, and Independent changed over time?
partyid_clean <- gss_cat %>%
mutate(partyid = fct_collapse(
partyid,
other = c("No answer", "Don't know", "Other party"),
rep = c("Strong republican", "Not str republican"),
ind = c("Ind,near rep", "Independent", "Ind,near dem"),
dem = c("Not str democrat", "Strong democrat")
))
partyid_counts <- partyid_clean %>%
count(year, partyid) %>%
group_by(year) %>%
mutate(prop = n / sum(n))
ggplot(partyid_counts, aes(x = year, y = prop, colour = fct_reorder2(
partyid, year, prop
))) +
geom_point() +
geom_line() +
labs(colour = "partyid")There appears to be less people identifying as Republican over time. In comparison, the proportion of people identifying as Independent appears to be increasing, while the number of Democrats appear about the same.
There is also a slight increase in the number of people identifying as Other since around 2005.
- How could you collapse
rincomeinto a small set of categories?
gss_cat %>%
count(rincome)## # A tibble: 16 x 2
## rincome n
## <fct> <int>
## 1 No answer 183
## 2 Don't know 267
## 3 Refused 975
## 4 $25000 or more 7363
## 5 $20000 - 24999 1283
## 6 $15000 - 19999 1048
## 7 $10000 - 14999 1168
## 8 $8000 to 9999 340
## 9 $7000 to 7999 188
## 10 $6000 to 6999 215
## 11 $5000 to 5999 227
## 12 $4000 to 4999 226
## 13 $3000 to 3999 276
## 14 $1000 to 2999 395
## 15 Lt $1000 286
## 16 Not applicable 7043
gss_cat %>%
mutate(rincome = fct_lump(rincome, n = 5)) %>%
count(rincome, sort = TRUE) %>%
mutate(rincome = fct_recode(
rincome,
"Less than $10000" = "Other"
))## # A tibble: 6 x 2
## rincome n
## <fct> <int>
## 1 $25000 or more 7363
## 2 Not applicable 7043
## 3 Less than $10000 3578
## 4 $20000 - 24999 1283
## 5 $10000 - 14999 1168
## 6 $15000 - 19999 1048
16. Dates and times
16.2.4 Exercises
What happens if you parse a string that contains invalid dates?
ymd(c("2010-10-10", "bananas"))
The invalid dates will be NA.
- What does the
tzoneargument totoday()do? Why is it important?
tzone specifies which time zone you would like to find the current date of.
Use the appropriate lubridate function to parse each of the following dates:
d1 <- "January 1, 2010" d2 <- "2015-Mar-07" d3 <- "06-Jun-2017" d4 <- c("August 19 (2015)", "July 1 (2015)") d5 <- "12/30/14" # Dec 30, 2014
mdy(d1)## [1] "2010-01-01"
ymd(d2)## [1] "2015-03-07"
dmy(d3)## [1] "2017-06-06"
mdy(d4)## [1] "2015-08-19" "2015-07-01"
mdy(d5)## [1] "2014-12-30"
16.3.4 Exercises
- How does the distribution of flight times within a day change over the course of the year?
flights_dt %>%
mutate(hour = hour(dep_time), month = factor(month(dep_time))) %>%
group_by(hour, month) %>%
summarise(
n = n()
) %>%
ggplot(aes(x = hour, y = n, colour = month)) +
geom_line()For the first day of each month it looks about the same.
Compare
dep_time,sched_dep_timeanddep_delay. Are they consistent? Explain your findings.Compare
air_timewith the duration between the departure and arrival. Explain your findings. (Hint: consider the location of the airport.)How does the average delay time change over the course of a day? Should you use
dep_timeorsched_dep_time? Why?On what day of the week should you leave if you want to minimise the chance of a delay?
flights_dt %>%
mutate(day = wday(sched_dep_time, label = TRUE)) %>%
group_by(day) %>%
summarise(dep_delay = mean(dep_delay)) %>%
ggplot(aes(day, dep_delay)) +
geom_col()Saturday.
What makes the distribution of
diamonds$caratandflights$sched_dep_timesimilar?Confirm my hypothesis that the early departures of flights in minutes 20-30 and 50-60 are caused by scheduled flights that leave early. Hint: create a binary variable that tells you whether or not a flight was delayed.
19. Functions
19.2.1 Practice
- Why is
TRUEnot a parameter torescale01()? What would happen ifxcontained a single missing value, andna.rmwasFALSE?
Because it is not an input. The function takes only one input (x) for computing. na.rm specifies whether NA values should be removed or not prior to the calculation.
rescale01 <- function(x) {
rng <- range(x, na.rm = TRUE)
(x - rng[1]) / (rng[2] - rng[1])
}
rescale01(c(0, 5))## [1] 0 1
So far so good.
rescale01 <- function(x) {
rng <- range(x, na.rm = FALSE)
(x - rng[1]) / (rng[2] - rng[1])
}
rescale01(c(0, 5, NA))## [1] NA NA NA
When na.rm is FALSE, NA values are contagious.
- In the second variant of
rescale01(), infinite values are left unchanged. Rewriterescale01()so that-Infis mapped to 0, andInfis mapped to 1.
x <- c(1:10, Inf)
rescale01_remapped <- function(x) {
rng <- range(x, na.rm = TRUE, finite = TRUE)
y <- (x - rng[1]) / (rng[2] - rng[1])
y[y == -Inf] <- 0
y[y == Inf] <- 1
y
}
rescale01_remapped(x)## [1] 0.0000000 0.1111111 0.2222222 0.3333333 0.4444444 0.5555556 0.6666667
## [8] 0.7777778 0.8888889 1.0000000 1.0000000
Practice turning the following code snippets into functions. Think about what each function does. What would you call it? How many arguments does it need? Can you rewrite it to be more expressive or less duplicative?
mean(is.na(x)) x / sum(x, na.rm = TRUE) sd(x, na.rm = TRUE) / mean(x, na.rm = TRUE)
mean() calculates the mean of an object. is.na() checks whether there are any missing values in an object and returns a logical vector.
x <- c(1, 5, 10, 25, 25)
mean(x)## [1] 13.2
y <- c(2, 5, NA, NA)Therefore, mean(is.na(x)) calculates the mean of a logical vector, where TRUE is equal to 1, and FALSE is equal to 0.
So, this function calculates the proportion of missing values in an object. It can be turned into a function like this:
prop_NA <- function(x){
mean(is.na(x))
}
x <- c(1, 10, NA, NA)
prop_NA(x)## [1] 0.5
x <- c(1, 5, 10, 25, 25)
x / sum(x, na.rm = TRUE)## [1] 0.01515152 0.07575758 0.15151515 0.37878788 0.37878788
The code above calculates the proportion of each value in a numeric vector by dividing each value by the total sum of the vector. It also removes any missing values prior to calculation.
prop <- function(x){
x / sum(x, na.rm = TRUE)
}
prop(x)## [1] 0.01515152 0.07575758 0.15151515 0.37878788 0.37878788
sd(x, na.rm = TRUE) / mean(x, na.rm = TRUE)## [1] 0.8510513
Finally, the last snippet of code divides the standard deviation of a numeric vector by the mean. According to Wikipedia, this is known as the coefficient of variation:
… the coefficient of variation, also known as relative standard deviation, is a standardized measure of dispersion of a probability distribution or frequency distribution.
It can be turned into a function like this:
coeff_var <- function(x){
sd(x, na.rm = TRUE) / mean(x, na.rm = TRUE)
}
coeff_var(x)## [1] 0.8510513
- Follow http://nicercode.github.io/intro/writing-functions.html to write your own functions to compute the variance and skew of a numeric vector.
# Variance
x <- sample(1:100, 100, replace = TRUE)
sample_var <- function(x){
n <- length(x)
m <- mean(x)
(1 / (n - 1)) * sum((x - m) ^ 2)
}
sample_var(x)## [1] 877.0461
# Computes the same result as the built-in function:
var(x)## [1] 877.0461
# Skew
x <- sample(1:100, 100, replace = TRUE)
skew <- function(x){
n <- length(x)
m <- mean(x)
sd <- sd(x)
(1 / n) * sum(x - m) ^ 3 / sd ^ 3
}
skew(x)## [1] 6.646333e-46
- Write
both_na(), a function that takes two vectors of the same length and returns the number of positions that have anNAin both vectors.
x <- c(1, 4, 2, NA, NA)
y <- c(3, 5, NA, 1, NA)
both_na <- function(x, y){
sum(is.na(x) & is.na(y))
}What do the following functions do? Why are they useful even though they are so short?
is_directory <- function(x) file.info(x)$isdir is_readable <- function(x) file.access(x, 4) == 0
is_directory("~/Google Drive/R/r4ds-solutions/data")## [1] TRUE
is_readable("index.html")## index.html
## TRUE
is_directory() can determine whether a file path is a directory. is_readable() tests whether a particular file is readable or not. These functions are helpful because the function names make it more readily apparent what each function does without the user having to look through the code.
- Read the complete lyrics to “Little Bunny Foo Foo”. There’s a lot of duplication in this song. Extend the initial piping example to recreate the complete song, and use functions to reduce the duplication.
foo_foo_mischief <- function(x){
foo_foo %>%
hop(through = forest) %>%
scoop(up = field_mouse) %>%
bop(on = head)
}
foo_foo_mischief()
good_fairy %>%
says("Little bunny Foo Foo
I don't want to see you
Scooping up the field mice
And bopping them on the head.
I'll give you three chances,
And if you don't stop, I'll turn you into a GOON!")19.3.1 Exercises
Read the source code for each of the following three functions, puzzle out what they do, and then brainstorm better names.
f1 <- function(string, prefix) { substr(string, 1, nchar(prefix)) == prefix } f2 <- function(x) { if (length(x) <= 1) return(NULL) x[-length(x)] } f3 <- function(x, y) { rep(y, length.out = length(x)) }
The first function checks whether a string has a given prefix.
f1("hello world", "hello")## [1] TRUE
substr() extracts substrings from a character vector by starting from the first element (1), and by using the number of characters in the prefix as the last/stop value. Then, it evaluates whether the substring is equal to that of the prefix.
substr("hello world", 1, nchar("hello"))## [1] "hello"
"hello" == "hello"## [1] TRUE
A better name for this function might be:
check_prefix <- function(string, prefix) {
substr(string, 1, nchar(prefix)) == prefix
}The second function removes the last observation in a vector (this doesn’t necessarily have to be an integer). For example:
remove_last <- function(x) {
if (length(x) <= 1) return(NULL)
x[-length(x)]
}
x <- sample(1:10, 10, replace = TRUE)
y <- c("hello", "world")
remove_last(x)## [1] 2 5 3 9 8 10 6 5 5
remove_last(y)## [1] "hello"
The last function replicates the values in y based on the length of x.
repeat_values <- function(x, y) {
rep(y, length.out = length(x))
}
x <- c(1, 3, 5, 3)
y <- c("hello", "world")
repeat_values(x, y)## [1] "hello" "world" "hello" "world"
- Take a function that you’ve written recently and spend 5 minutes brainstorming a better name for it and its arguments.
df <- tibble(
a = c("47%", "23%", "97%", "88%", "12%")
)
remove_percentage <- function(column) {
parse_double(gsub("[%]", "", column))
}
df$a <- remove_percentage(df$a)
df## # A tibble: 5 x 1
## a
## <dbl>
## 1 47.
## 2 23.
## 3 97.
## 4 88.
## 5 12.
- Compare and contrast
rnorm()andMASS::mvrnorm(). How could you make them more consistent?
rnorm() produces random numbers that are normally distributed with a mean equal to 0 and a standard deviation equal to 1. In comparison, MASS::mvrnorm() produces samples from a multivariate normal distribution. To make them more consistent, they should use the same argument names. For example, MASS::mvrnorm() uses mu, where as rnorm() uses mean. Using consistent names between the functions would make them easier to use.
- Make a case for why
norm_r(),norm_d()etc would be better thanrnorm(),dnorm(). Make a case for the opposite.
Using norm_r(), norm_d() instead of rnorm(), dnorm() would be better because it would group the functions related to the normal distribution. In comparison, using rnorm(), dnorm() could be better because it groups them by actions. For example, r* functions involve sampling from distributions.
19.4.4 Exercises
- What’s the difference between
ifandifelse()? Carefully read the help and construct three examples that illustrate the key differences.
ifelse() returns a vector of the same length as test filled with elements from the yes or no argument, depending on whether the test is TRUE or FALSE.
y <- sample(1:10, 10, replace = TRUE)
x <- ifelse(y < 5, "Too low", "Too high")
x## [1] "Too low" "Too high" "Too low" "Too low" "Too high" "Too low"
## [7] "Too low" "Too low" "Too high" "Too low"
In comparison, if will return one value only because the condition only works with a length-one logical vector.
y <- sample(1:10, 10, replace = TRUE)
x <- if (y < 20) "Too low" else "Too high"## Warning in if (y < 20) "Too low" else "Too high": the condition has length
## > 1 and only the first element will be used
x## [1] "Too low"
y <- 10
x <- if (y < 20) "Too low" else "Too high"
x## [1] "Too low"
- Write a greeting function that says “good morning”, “good afternoon”, or “good evening”, depending on the time of day. (Hint: use a time argument that defaults to
lubridate::now(). That will make it easier to test your function.)
greeting <- function(time = lubridate::now()) {
hour <- hour(time)
if (hour > 0 && hour < 12) {
"good morning"
} else if (hour >= 12 & hour < 18) {
"good afternoon"
} else {
"good evening"
}
}
greeting()## [1] "good morning"
- Implement a
fizzbuzzfunction. It takes a single number as input. If the number is divisible by three, it returns “fizz”. If it’s divisible by five it returns “buzz”. If it’s divisible by three and five, it returns “fizzbuzz”. Otherwise, it returns the number. Make sure you first write working code before you create the function.
fizzbuzz <- function(x) {
if (x %% 3 == 0 && x %% 5 == 0) {
"fizzbuzz"
} else if (x %% 3 == 0) {
"fizz"
} else if (x %% 5 == 0) {
"buzz"
} else {
x
}
}
# Test function
fizzbuzz(15)## [1] "fizzbuzz"
fizzbuzz(3)## [1] "fizz"
fizzbuzz(5)## [1] "buzz"
fizzbuzz(2)## [1] 2
How could you use
cut()to simplify this set of nested if-else statements?if (temp <= 0) { "freezing" } else if (temp <= 10) { "cold" } else if (temp <= 20) { "cool" } else if (temp <= 30) { "warm" } else { "hot" }How would you change the call to
cut()if I’d used<instead of<=? What is the other chief advantage ofcut()for this problem? (Hint: what happens if you have many values intemp?)
cut() divides the values of a numeric vector into intervals and recodes these values according to which interval they fall.
x <- 1:10
cut(x, breaks = c(1, 6, 10), include.lowest = TRUE, right = FALSE)## [1] [1,6) [1,6) [1,6) [1,6) [1,6) [6,10] [6,10] [6,10] [6,10] [6,10]
## Levels: [1,6) [6,10]
To simplify the example, you can replace the nested if-else statements with cut():
temp <- seq(-5, 40, by = 5)
cut(temp, breaks = c(-Inf, 0, 10, 20, 30, Inf), right = TRUE, labels = c("freezing", "cold", "cool", "warm", "hot"))## [1] freezing freezing cold cold cool cool warm
## [8] warm hot hot
## Levels: freezing cold cool warm hot
If < was used instead of <=, you would have to change the right argument in cut():
temp <- seq(-5, 40, by = 5)
cut(temp, breaks = c(-Inf, 0, 10, 20, 30, Inf), right = FALSE, labels = c("freezing", "cold", "cool", "warm", "hot"))## [1] freezing cold cold cool cool warm warm
## [8] hot hot hot
## Levels: freezing cold cool warm hot
The main advantage of using cut() for this problem is that it takes a vector of multiple values, not just one (as is the case with an if-else statement).
- What happens if you use
switch()with numeric values?
The first argument of switch (EXPR) determines which element of ... is returned.
switch(EXPR = 2, 5, 6, 7, 8)## [1] 6
In this example, EXPR = 2, so the second element from ... (6) is returned.
What does this
switch()call do? What happens ifxis “e”?switch(x, a = , b = "ab", c = , d = "cd" )Experiment, then carefully read the documentation.
It takes the EXPR (x) and looks for a value in ... (stored in the elements a, b, c, and d) that partially matches and returns that value.
x <- "a"
switch(x,
a = ,
b = "ab",
c = ,
d = "cd"
)## [1] "ab"
x <- "e"
switch(x,
a = ,
b = "ab",
c = ,
d = "cd"
)If x is “e” then it returns nothing.
19.5.5 Exercises
- What does
commas(letters, collapse = "-")do? Why?
It collapses letters into a length-one vector, placing “-” in between each letter. I think the argument collapse might be better named sep or something similar.
- It’d be nice if you could supply multiple characters to the
padargument, e.g.rule("Title", pad = "-+"). Why doesn’t this currently work? How could you fix it?
rule <- function(..., pad = "-") {
title <- paste0(...)
width <- getOption("width") - nchar(title) - 5
cat(title, " ", stringr::str_dup(pad, width), "\n", sep = "")
}
rule("Title", pad = "-+")## Title -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
It does appear to work.
- What does the
trimargument tomean()do? When might you use it?
trim allows you to remove observations from each end of x before computing the mean.
x <- sample(1:10, 10, replace = TRUE)
mean(x)## [1] 5.3
mean(x, trim = 0.1)## [1] 5.25
- The default value for the
methodargument tocor()isc("pearson", "kendall", "spearman"). What does that mean? What value is used by default?
These are the available methods for computing a correlation (it will not accept any other strings). pearson is the default.
random_sample <- function(x) {
sample(1:100, 50, replace = TRUE)
}
x <- random_sample(x)
y <- random_sample(x)
cor(x, y, method = "spearman")## [1] 0.04182181
cor(x, y, method = "caveman") # throws an error## Error in match.arg(method): 'arg' should be one of "pearson", "kendall", "spearman"
21. Iteration
21.2.1 Exercises
Write for loops to:
- Compute the mean of every column in
mtcars. - Determine the type of each column in
nycflights13::flights. - Compute the number of unique values in each column of
iris. - Generate 10 random normals for each of \(\mu = -10\), \(0\), \(10\), and \(100\).
Think about the output, sequence, and body before you start writing the loop.
- Compute the mean of every column in
Compute the mean of every column in mtcars:
mean <- vector("double", ncol(mtcars))
names(mean) <- names(mtcars)
for (i in seq_along(mtcars)) {
mean[[i]] <- mean(mtcars[[i]])
}
mean## mpg cyl disp hp drat wt
## 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250
## qsec vs am gear carb
## 17.848750 0.437500 0.406250 3.687500 2.812500
Determine the type of each column in nycflights13::flights:
col_type <- vector("list", ncol(flights))
names(col_type) <- names(flights)
for (i in seq_along(flights)) {
col_type[[i]] <- class(flights[[i]])
}
col_type## $year
## [1] "integer"
##
## $month
## [1] "integer"
##
## $day
## [1] "integer"
##
## $dep_time
## [1] "integer"
##
## $sched_dep_time
## [1] "integer"
##
## $dep_delay
## [1] "numeric"
##
## $arr_time
## [1] "integer"
##
## $sched_arr_time
## [1] "integer"
##
## $arr_delay
## [1] "numeric"
##
## $carrier
## [1] "character"
##
## $flight
## [1] "integer"
##
## $tailnum
## [1] "character"
##
## $origin
## [1] "character"
##
## $dest
## [1] "character"
##
## $air_time
## [1] "numeric"
##
## $distance
## [1] "numeric"
##
## $hour
## [1] "numeric"
##
## $minute
## [1] "numeric"
##
## $time_hour
## [1] "POSIXct" "POSIXt"
Compute the number of unique values in each column of iris:
unique_values <- vector("integer", ncol(iris))
names(unique_values) <- names(iris)
for (i in seq_along(iris)) {
unique_values[[i]] <- n_distinct(iris[[i]])
}
unique_values## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 35 23 43 22 3
# Check the above results
fct_unique(iris$Species)## [1] setosa versicolor virginica
## Levels: setosa versicolor virginica
Generate 10 random normals for each of \(\mu = -10\), \(0\), \(10\), and \(100\):
mu <- c(-10, 0, 10, 100)
random_normals <- vector("list", length(mu))
for (i in seq_along(random_normals)) {
random_normals[[i]] <- rnorm(10, mean = mu[[i]])
}
random_normals## [[1]]
## [1] -9.668042 -12.752640 -11.666053 -11.786233 -11.171829 -10.472993
## [7] -10.390668 -9.098248 -9.370741 -10.759146
##
## [[2]]
## [1] -1.57981451 -0.69247715 -0.32280781 -0.41258331 -0.42207960
## [6] 1.32236125 0.03267892 1.81945278 -0.79630745 -1.90735874
##
## [[3]]
## [1] 10.305425 9.566745 10.780071 10.996345 10.745244 11.101034 9.884776
## [8] 10.639426 9.819239 10.231762
##
## [[4]]
## [1] 100.85902 99.03889 99.76429 100.08287 100.84113 98.01641 98.80659
## [8] 98.78463 100.69881 99.15648
Eliminate the for loop in each of the following examples by taking advantage of an existing function that works with vectors:
out <- "" for (x in letters) { out <- stringr::str_c(out, x) } x <- sample(100) sd <- 0 for (i in seq_along(x)) { sd <- sd + (x[i] - mean(x)) ^ 2 } sd <- sqrt(sd / (length(x) - 1)) x <- runif(100) out <- vector("numeric", length(x)) out[1] <- x[1] for (i in 2:length(x)) { out[i] <- out[i - 1] + x[i] }Combine your function writing and for loop skills:
Write a for loop that
prints()the lyrics to the children’s song “Alice the camel”.Convert the nursery rhyme “ten in the bed” to a function. Generalise it to any number of people in any sleeping structure.
Convert the song “99 bottles of beer on the wall” to a function. Generalise to any number of any vessel containing any liquid on any surface.
It’s common to see for loops that don’t preallocate the output and instead increase the length of a vector at each step:
output <- vector("integer", 0) for (i in seq_along(x)) { output <- c(output, lengths(x[[i]])) } outputHow does this affect performance? Design and execute an experiment.
Preallocated loops perform increasingly better as inputs increase in size.
27. R Markdown
27.2.1 Exercises
Create a new notebook using File > New File > R Notebook. Read the instructions. Practice running the chunks. Verify that you can modify the code, re-run it, and see modified output.
Create a new R Markdown document with File > New File > R Markdown… Knit it by clicking the appropriate button. Knit it by using the appropriate keyboard short cut. Verify that you can modify the input and see the output update.
Compare and contrast the R notebook and R markdown files you created above. How are the outputs similar? How are they different? How are the inputs similar? How are they different? What happens if you copy the YAML header from one to the other?
- R notebook files are suitable for capturing what you did, what you were thinking, and communicating to other data scientists
- R markdown files are suitable for communicating to decision-makers who don’t care about the code
- An R notebook file allows the user to show/hide code and download the file; an R markdown file does not
- The R markdown file contains additional information in the YAML header (author and date) by default
- When you copy the YAML header from one to the other it changes the type of file
output:determines the type of file that will be created
- Create one new R Markdown document for each of the three built-in formats: HTML, PDF and Word. Knit each of the three documents. How does the output differ? How does the input differ? (You may need to install LaTeX in order to build the PDF output — RStudio will prompt you if this is necessary.)
Generally, each output will differ visually. For example, the html output uses a sans-serif font while the pdf output uses a serif font.
27.3.1 Exercises
- Practice what you’ve learned by creating a brief CV. The title should be your name, and you should include headings for (at least) education or employment. Each of the sections should include a bulleted list of jobs/degrees. Highlight the year in bold.
See this repo.
Using the R Markdown quick reference, figure out how to:
- Add a footnote.
- Add a horizontal rule.
- Add a block quote.
Add a footnote:
A footnote[^1]
[^1]: Here is the footnoteA footnote1
(This footnote will be displayed at the bottom of this document.)
Add a horizontal rule:
***Add a block quote:
> block quoteblock quote
- Copy and paste the contents of
diamond-sizes.Rmdfrom https://github.com/hadley/r4ds/tree/master/rmarkdown in to a local R markdown document. Check that you can run it, then add text after the frequency polygon that describes its most striking features.
28. Graphics for communication
28.2.1 Exercises
- Create one plot on the fuel economy data with customised
title,subtitle,caption,x,y, andcolourlabels.
ggplot(mpg, aes(x = hwy, y = cty)) +
geom_point(aes(colour = class)) +
labs(
title = "Small cars are more fuel efficient",
subtitle = "SUVs have poor fuel economy",
caption = "Data from fueleconomy.gov",
x = "Highway miles per gallon",
y = "City miles per gallon",
colour = "Class"
)The
geom_smooth()is somewhat misleading because thehwyfor large engines is skewed upwards due to the inclusion of lightweight sports cars with big engines. Use your modelling tools to fit and display a better model.Take an exploratory graphic that you’ve created in the last month, and add informative titles to make it easier for others to understand.
ggplot(iris, aes(x = Petal.Length, y = Petal.Width)) +
geom_point(aes(colour = Species)) +
labs(
title = "Iris virginica and versicolor have large petals",
subtitle = "Iris setosa have substantially smaller petals",
caption = "Source: Edgar Anderson's Iris Data",
x = "Petal length (cm)",
y = "Petal width (cm)",
colour = "Species of iris"
)28.3.1 Exercises
- Use
geom_text()with infinite positions to place text at the four corners of the plot.
labels <- tibble(
cty = c(Inf, Inf, -Inf, -Inf),
displ = c(Inf, -Inf, Inf, -Inf),
labels = c("Text", "tExt", "teXt", "texT"),
vjust_var = c("top", "top", "bottom", "bottom"),
hjust_var = c("right", "left", "right", "left")
)
ggplot(mpg, aes(x = displ, y = cty)) +
geom_point(aes(colour = class)) +
geom_text(data = labels, aes(label = labels, vjust = vjust_var, hjust = hjust_var))- Read the documentation for
annotate(). How can you use it to add a text label to a plot without having to create a tibble?
ggplot(mpg, aes(x = displ, y = cty)) +
geom_point(aes(colour = class)) +
annotate("text", x = 7, y = 35, label = "Text")You can also easily add multiple labels:
ggplot(mpg, aes(x = displ, y = cty)) +
geom_point(aes(colour = class)) +
annotate("text", x = c(7, 1), y = c(35, 35), label = c("Text", "tExt"))How do labels with
geom_text()interact with faceting? How can you add a label to a single facet? How can you put a different label in each facet? (Hint: think about the underlying data.)What arguments to
geom_label()control the appearance of the background box?
The fill aesthetic. For example:
best_in_class <- mpg %>%
group_by(class) %>%
filter(row_number(desc(hwy)) == 1)
ggplot(mpg, aes(x = displ, y = cty)) +
geom_point(aes(colour = class)) +
geom_label(aes(label = model, fill = class), data = best_in_class)- What are the four arguments to
arrow()? How do they work? Create a series of plots that demonstrate the most important options.
The four arguments to arrow() are angle, length, ends, and type. geom_segment has an arrow argument.
Here is the footnote↩